06 Contextual Policy Learning

This notebook is the advanced modeling notebook for Off-Policy Evaluation of Recommendation Systems.

The earlier notebooks evaluated simple policies that assigned the same item probabilities for every user/context. That was useful for learning IPS, SNIPS, direct method, and doubly robust OPE, but real recommendation systems are contextual. A production recommendation policy should change its item choices based on user features, time, position, and item-user affinity signals.

This notebook asks:

Can we learn a context-aware recommendation policy from logged data, then evaluate it offline with OPE?

We train reward models, score every held-out context against every candidate item, convert those scores into contextual policies, and evaluate those policies with IPS, SNIPS, direct method, and doubly robust estimation.

Why Contextual Policy Learning?

A fixed policy says, for example, “recommend item 7 with 20% probability for everyone.” That is easy to evaluate, but it ignores personalization.

A contextual policy says, “for this user and this context, choose the item with the best predicted reward distribution.” This is closer to real recommendation systems because the policy can adapt to:

  • user features
  • time of day
  • recommendation position
  • item metadata
  • user-item affinity signals

The danger is that learned contextual policies can become aggressive. A fully greedy policy may put all probability on one predicted-best item per context. If the behavior policy rarely logged that item for similar contexts, IPS and DR corrections can become noisy. This notebook therefore compares aggressive and conservative learned policies.

Notebook Setup

This cell imports the libraries used for reward modeling, policy learning, OPE, and plotting.

It also suppresses one known LightGBM/sklearn metadata warning about feature names. That warning appears when LightGBM is used inside a scikit-learn pipeline and does not indicate a modeling problem. The filter is targeted so other warnings remain visible.

from pathlib import Path
import warnings

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, brier_score_loss, log_loss, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

pd.set_option("display.max_columns", 140)
pd.set_option("display.max_rows", 140)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names, but LGBMClassifier was fitted with feature names",
    category=UserWarning,
)

This cell prepares the notebook environment for contextual policy learning with offline evaluation. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Load The Random Open Bandit Sample

This cell loads the same random/men sample used throughout the OPE project.

The random behavior policy is valuable here because it gives broad action support. Contextual policies can be more concentrated than fixed policies, so a randomized log is the safest place to evaluate them offline.

RANDOM_SAMPLE_RELATIVE_PATH = Path("data/processed/open_bandit_random_men_sample.parquet")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / RANDOM_SAMPLE_RELATIVE_PATH).exists()
)

RANDOM_SAMPLE_PATH = PROJECT_ROOT / RANDOM_SAMPLE_RELATIVE_PATH
random_df = pd.read_parquet(RANDOM_SAMPLE_PATH).sort_values("timestamp").reset_index(drop=True)

pd.Series(
    {
        "project_root": PROJECT_ROOT,
        "random_sample_path": RANDOM_SAMPLE_PATH,
        "rows": len(random_df),
        "columns": random_df.shape[1],
        "observed_click_rate": random_df["click"].mean(),
    }
).to_frame("value")
value
project_root /home/apex/Documents/ranking_sys
random_sample_path /home/apex/Documents/ranking_sys/data/processe...
rows 200000
columns 50
observed_click_rate 0.005190

The loaded table shape and preview confirm that the expected cached data is available. This check matters because all later OPE estimates depend on using the correct logged actions, rewards, contexts, and behavior propensities.

Create Train And Evaluation Splits

This notebook uses the same primary time split as the previous OPE notebooks.

The training split is used to fit reward models and define learned policies. The evaluation split is used for off-policy evaluation. This keeps policy learning separate from policy evaluation and reduces the risk of overclaiming from in-sample performance.

SPLIT_FRACTION = 0.50
split_idx = int(len(random_df) * SPLIT_FRACTION)

train_df = random_df.iloc[:split_idx].copy()
eval_df = random_df.iloc[split_idx:].copy()

split_summary = pd.DataFrame(
    {
        "split": ["train", "evaluation"],
        "rows": [len(train_df), len(eval_df)],
        "min_timestamp": [train_df["timestamp"].min(), eval_df["timestamp"].min()],
        "max_timestamp": [train_df["timestamp"].max(), eval_df["timestamp"].max()],
        "click_rate": [train_df["click"].mean(), eval_df["click"].mean()],
    }
)

split_summary
split rows min_timestamp max_timestamp click_rate
0 train 100000 2019-11-24 00:00:03.800821+00:00 2019-11-25 10:01:18.392921+00:00 0.005400
1 evaluation 100000 2019-11-25 10:01:18.393450+00:00 2019-11-27 02:50:16.027289+00:00 0.004980

The split separates policy construction from policy evaluation. This prevents using the same rows to design a policy and evaluate it, which would make the offline result too optimistic.

Define Feature Groups And Action Space

This cell defines the action space and model feature groups.

The action is item_id. The reward model uses user features, item features, position, hour, and a selected affinity feature. The selected affinity feature is built by choosing the user-item affinity column that corresponds to the candidate item being scored.

action_space = np.array(sorted(random_df["item_id"].unique()))
n_actions = len(action_space)

action_to_index = {item_id: idx for idx, item_id in enumerate(action_space)}
user_feature_cols = [col for col in random_df.columns if col.startswith("user_feature_")]
affinity_cols_by_action = [f"user-item_affinity_{item_id}" for item_id in action_space]
item_feature_cols = [col for col in random_df.columns if col.startswith("item_feature_")]

categorical_features = [
    "position",
    "hour",
    "item_id",
    *user_feature_cols,
    "item_feature_1",
    "item_feature_2",
    "item_feature_3",
]
numeric_features = ["selected_affinity", "item_feature_0"]
feature_cols = categorical_features + numeric_features

feature_summary = pd.DataFrame(
    {
        "group": ["actions", "user features", "affinity columns", "item features", "model categorical", "model numeric"],
        "count": [n_actions, len(user_feature_cols), len(affinity_cols_by_action), len(item_feature_cols), len(categorical_features), len(numeric_features)],
    }
)

feature_summary
group count
0 actions 34
1 user features 4
2 affinity columns 34
3 item features 4
4 model categorical 10
5 model numeric 2

The action-space definition fixes the set of items that evaluation policies are allowed to choose. OPE estimates are only meaningful for policies whose action probabilities live inside this logged support.

Build Item Metadata Lookup

The cached sample contains item metadata for the logged item. To score counterfactual candidate items, we need item metadata for every candidate item.

This cell builds a one-row-per-item lookup table. When we score candidate item a, we attach the metadata for a, regardless of which item was actually logged in that row.

item_context = (
    random_df[["item_id", *item_feature_cols]]
    .drop_duplicates("item_id")
    .set_index("item_id")
    .sort_index()
)

missing_items = sorted(set(action_space) - set(item_context.index))
if missing_items:
    raise ValueError(f"Missing item metadata for item IDs: {missing_items}")

item_context.head()
item_feature_0 item_feature_1 item_feature_2 item_feature_3
item_id
0 -0.677183 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c
1 -0.720300 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c
2 0.745662 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759
3 -0.698741 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d
4 1.651109 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759

These feature-building cells define the context used by reward models. Reward models need both user context and candidate item context so they can predict counterfactual rewards for actions that were not logged.

Define Feature Builders

This cell defines two feature builders.

make_logged_feature_frame creates features for the item actually shown in each logged row. This is used to train and evaluate reward models on observed outcomes.

make_candidate_feature_frame creates features for every candidate item in each context. This is used to learn contextual policies and compute direct-method components.

def make_logged_feature_frame(df):
    """Create model features for the logged action in each row."""
    frame = df.copy()
    affinity_matrix = frame[affinity_cols_by_action].to_numpy()
    item_positions = frame["item_id"].map(action_to_index).to_numpy()
    frame["selected_affinity"] = affinity_matrix[np.arange(len(frame)), item_positions]
    return frame[feature_cols]


def make_candidate_feature_frame(context_df):
    """Create model features for every candidate action in each context row."""
    n_contexts = len(context_df)
    tiled_actions = np.tile(action_space, n_contexts)

    frame = pd.DataFrame(
        {
            "position": np.repeat(context_df["position"].to_numpy(), n_actions),
            "hour": np.repeat(context_df["hour"].to_numpy(), n_actions),
            "item_id": tiled_actions,
        }
    )

    for col in user_feature_cols:
        frame[col] = np.repeat(context_df[col].to_numpy(), n_actions)

    affinity_matrix = context_df[affinity_cols_by_action].to_numpy()
    frame["selected_affinity"] = affinity_matrix.reshape(-1)

    repeated_item_context = item_context.loc[tiled_actions, item_feature_cols].reset_index(drop=True)
    frame = pd.concat([frame, repeated_item_context], axis=1)

    return frame[feature_cols]

make_candidate_feature_frame(eval_df.head(2))
position hour item_id user_feature_0 user_feature_1 user_feature_2 user_feature_3 item_feature_1 item_feature_2 item_feature_3 selected_affinity item_feature_0
0 2 10 0 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.677183
1 2 10 1 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.720300
2 2 10 2 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.745662
3 2 10 3 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
4 2 10 4 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
5 2 10 5 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe c43671ed6855a6fe2e2a6030cba64366 bbf748c6c978938bc63d432efa60191c 0.000000 0.142031
6 2 10 6 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 7082af732502f0981a9fe77d7ba1ae8a b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
7 2 10 7 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 2.858372
8 2 10 8 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.349294
9 2 10 9 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
10 2 10 10 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.586435
11 2 10 11 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 eddad9910a6d2f61905f408d4df575c5 de8b129010093b09b24a05592bfd8843 0.000000 0.443847
12 2 10 12 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
13 2 10 13 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.616313
14 2 10 14 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -1.000557
15 2 10 15 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 9f9ff361c09f765650f1c43ef7adac86 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.375367
16 2 10 16 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.590950
17 2 10 17 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
18 2 10 18 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.914324
19 2 10 19 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.763416
20 2 10 20 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.612508
21 2 10 21 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
22 2 10 22 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.698741
23 2 10 23 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 55fe518d85813954c7d9b8a875ff2453 cc75031396a5aa830885915aa93f49d0 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 -0.569392
24 2 10 24 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.422288
25 2 10 25 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
26 2 10 26 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 a86ead010f033dbc2854c6a46f4fe7a7 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.896570
27 2 10 27 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.849649
28 2 10 28 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -1.065232
29 2 10 29 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.849649
30 2 10 30 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -0.914324
31 2 10 31 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 55fe518d85813954c7d9b8a875ff2453 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
32 2 10 32 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.526275
33 2 10 33 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.612508
34 2 10 0 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.677183
35 2 10 1 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.720300
36 2 10 2 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.745662
37 2 10 3 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
38 2 10 4 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
39 2 10 5 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe c43671ed6855a6fe2e2a6030cba64366 bbf748c6c978938bc63d432efa60191c 0.000000 0.142031
40 2 10 6 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 7082af732502f0981a9fe77d7ba1ae8a b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
41 2 10 7 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 2.858372
42 2 10 8 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.349294
43 2 10 9 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
44 2 10 10 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.586435
45 2 10 11 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 eddad9910a6d2f61905f408d4df575c5 de8b129010093b09b24a05592bfd8843 0.000000 0.443847
46 2 10 12 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
47 2 10 13 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.616313
48 2 10 14 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -1.000557
49 2 10 15 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 9f9ff361c09f765650f1c43ef7adac86 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.375367
50 2 10 16 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.590950
51 2 10 17 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
52 2 10 18 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.914324
53 2 10 19 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.763416
54 2 10 20 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.612508
55 2 10 21 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
56 2 10 22 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.698741
57 2 10 23 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 55fe518d85813954c7d9b8a875ff2453 cc75031396a5aa830885915aa93f49d0 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 -0.569392
58 2 10 24 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.422288
59 2 10 25 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
60 2 10 26 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 a86ead010f033dbc2854c6a46f4fe7a7 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.896570
61 2 10 27 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.849649
62 2 10 28 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -1.065232
63 2 10 29 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.849649
64 2 10 30 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -0.914324
65 2 10 31 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 55fe518d85813954c7d9b8a875ff2453 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
66 2 10 32 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.526275
67 2 10 33 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.612508

These feature-building cells define the context used by reward models. Reward models need both user context and candidate item context so they can predict counterfactual rewards for actions that were not logged.

Train Reward Models

This cell trains two reward models:

  • logistic regression as a transparent baseline
  • LightGBM as the advanced nonlinear model used for contextual policy learning

The LightGBM model can capture nonlinear interactions between user features, item features, position, and affinity. That makes it a reasonable advanced model for learning personalized item scores.

X_train = make_logged_feature_frame(train_df)
y_train = train_df["click"].astype(int)
X_eval_logged = make_logged_feature_frame(eval_df)
y_eval = eval_df["click"].astype(int)


def build_preprocessor():
    return ColumnTransformer(
        transformers=[
            ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features),
            ("numeric", StandardScaler(), numeric_features),
        ],
        remainder="drop",
    )

reward_models = {
    "logistic": Pipeline(
        steps=[
            ("preprocess", build_preprocessor()),
            ("model", LogisticRegression(max_iter=500, solver="lbfgs")),
        ]
    ),
    "lightgbm": Pipeline(
        steps=[
            ("preprocess", build_preprocessor()),
            (
                "model",
                lgb.LGBMClassifier(
                    n_estimators=220,
                    learning_rate=0.05,
                    num_leaves=31,
                    min_child_samples=100,
                    subsample=0.85,
                    colsample_bytree=0.85,
                    random_state=42,
                    verbose=-1,
                ),
            ),
        ]
    ),
}

for model_name, model in reward_models.items():
    model.fit(X_train, y_train)

list(reward_models)
['logistic', 'lightgbm']

The reward models learn expected click probability from logged action-context pairs. These predictions power the direct method and the model-based part of the doubly robust estimator.

Evaluate Reward Model Quality

Before using a reward model to define policies, we check whether it has predictive signal on held-out logged actions.

AUC and average precision measure ranking quality. Log loss and Brier score measure probability quality. For contextual policy learning, we care about both: the policy ranks candidate items by predicted reward, and DR uses predicted probabilities in the direct component and residual correction.

logged_predictions = {}
model_metric_rows = []

for model_name, model in reward_models.items():
    pred = model.predict_proba(X_eval_logged)[:, 1]
    logged_predictions[model_name] = pred
    model_metric_rows.append(
        {
            "reward_model": model_name,
            "auc": roc_auc_score(y_eval, pred),
            "average_precision": average_precision_score(y_eval, pred),
            "log_loss": log_loss(y_eval, pred, labels=[0, 1]),
            "brier_score": brier_score_loss(y_eval, pred),
            "mean_prediction": pred.mean(),
            "observed_click_rate": y_eval.mean(),
        }
    )

reward_model_metrics = pd.DataFrame(model_metric_rows).sort_values("log_loss")
reward_model_metrics
reward_model auc average_precision log_loss brier_score mean_prediction observed_click_rate
0 logistic 0.539928 0.006395 0.032528 0.005081 0.005774 0.004980
1 lightgbm 0.531891 0.007452 0.034658 0.005114 0.004844 0.004980

The reward-model diagnostics show whether predicted click probabilities align with observed clicks. Calibration matters because DR uses these predictions as counterfactual baselines before applying residual correction.

Score Every Candidate Item With LightGBM

This cell uses the LightGBM reward model to predict click probability for every (context, candidate item) pair in the evaluation split.

The result is a matrix with one row per evaluation context and one column per candidate item. This matrix is the raw material for contextual policy learning: each row tells us which items look best for that specific context.

def predict_candidate_rewards(model, contexts_df, batch_size=12_000):
    prediction_batches = []
    for start in range(0, len(contexts_df), batch_size):
        context_batch = contexts_df.iloc[start : start + batch_size]
        candidate_features = make_candidate_feature_frame(context_batch)
        q_hat = model.predict_proba(candidate_features)[:, 1].reshape(len(context_batch), n_actions)
        prediction_batches.append(q_hat)
    return np.vstack(prediction_batches)

q_matrix = predict_candidate_rewards(reward_models["lightgbm"], eval_df, batch_size=12_000)
q_logged = q_matrix[np.arange(len(eval_df)), eval_df["item_id"].map(action_to_index).to_numpy()]

pd.Series(
    {
        "contexts_scored": q_matrix.shape[0],
        "candidate_actions": q_matrix.shape[1],
        "mean_candidate_prediction": q_matrix.mean(),
        "mean_logged_action_prediction": q_logged.mean(),
        "max_candidate_prediction": q_matrix.max(),
    }
).to_frame("value")
value
contexts_scored 100000.000000
candidate_actions 34.000000
mean_candidate_prediction 0.004898
mean_logged_action_prediction 0.004844
max_candidate_prediction 0.837817

This output is part of the contextual policy learning with offline evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Inspect Predicted Best Items

This cell identifies the item with the highest predicted reward for each evaluation context.

A learned contextual policy is useful only if it actually changes actions across contexts. If the same item is always best, the model has effectively learned a fixed top-item policy. If top items vary, the model is using context to personalize recommendations.

best_action_indices = q_matrix.argmax(axis=1)
best_item_ids = action_space[best_action_indices]

best_item_distribution = (
    pd.Series(best_item_ids, name="best_item_id")
    .value_counts(normalize=False)
    .rename_axis("item_id")
    .reset_index(name="contexts_as_best")
)
best_item_distribution["context_share"] = best_item_distribution["contexts_as_best"] / len(eval_df)

best_item_distribution.head(12)
item_id contexts_as_best context_share
0 27 12841 0.128410
1 31 12609 0.126090
2 23 9908 0.099080
3 19 8031 0.080310
4 13 5475 0.054750
5 14 5101 0.051010
6 21 5055 0.050550
7 22 3572 0.035720
8 12 3493 0.034930
9 10 2915 0.029150
10 15 2284 0.022840
11 32 2275 0.022750

The concentration diagnostics show whether learned contextual policies collapse onto a few items or preserve diversity. Strong concentration may raise support risk even if predicted rewards are high.

Plot Predicted Best Item Concentration

This plot shows how often each item is the predicted best action across contexts.

Some concentration is expected, but extreme concentration is a warning sign. If one item dominates all contexts, a greedy policy may have weak support and poor user experience. More diverse top-item predictions suggest more contextual personalization.

top_best_items = best_item_distribution.head(12).sort_values("context_share")

fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=top_best_items, x="context_share", y="item_id", orient="h", ax=ax, color="#4C78A8")
ax.set_title("Most Common Predicted-Best Items")
ax.set_xlabel("Share of Evaluation Contexts")
ax.set_ylabel("Item ID")
ax.xaxis.set_major_formatter(lambda x, _: f"{x:.1%}")
plt.tight_layout()
plt.show()

The concentration metrics quantify how spread out the logging policy is across actions. Lower entropy or higher concentration means fewer actions dominate the log, increasing support risk.

Recreate Fixed Baseline Policies

Before adding contextual policies, we recreate two fixed baselines:

  • uniform: every item gets equal probability
  • fixed_ctr_weighted: item probability is proportional to smoothed training CTR

These baselines make the contextual policy comparison grounded. A learned contextual policy should beat or at least improve on these simpler policies after accounting for support and uncertainty.

SMOOTHING_ALPHA = 50
train_global_ctr = train_df["click"].mean()

item_stats = (
    train_df.groupby("item_id")
    .agg(train_impressions=("click", "size"), train_clicks=("click", "sum"), train_ctr=("click", "mean"))
    .reindex(action_space, fill_value=0)
    .rename_axis("item_id")
    .reset_index()
)
item_stats["smoothed_ctr"] = (
    item_stats["train_clicks"] + SMOOTHING_ALPHA * train_global_ctr
) / (item_stats["train_impressions"] + SMOOTHING_ALPHA)

def normalize_probabilities(values):
    values = np.asarray(values, dtype=float)
    values = np.clip(values, 0, None)
    total = values.sum()
    if total <= 0:
        raise ValueError("Policy scores must have positive total mass.")
    return values / total

uniform_probs = np.full(n_actions, 1 / n_actions)
fixed_ctr_probs = normalize_probabilities(item_stats["smoothed_ctr"].to_numpy())

fixed_policy_summary = pd.DataFrame(
    {
        "item_id": action_space,
        "uniform": uniform_probs,
        "fixed_ctr_weighted": fixed_ctr_probs,
        "smoothed_ctr": item_stats["smoothed_ctr"],
    }
)

fixed_policy_summary.sort_values("fixed_ctr_weighted", ascending=False).head(10)
item_id uniform fixed_ctr_weighted smoothed_ctr
31 31 0.029412 0.080613 0.014926
27 27 0.029412 0.061867 0.011455
23 23 0.029412 0.053500 0.009906
14 14 0.029412 0.045993 0.008516
21 21 0.029412 0.044183 0.008181
0 0 0.029412 0.040498 0.007498
22 22 0.029412 0.040379 0.007476
20 20 0.029412 0.037530 0.006949
12 12 0.029412 0.037085 0.006867
2 2 0.029412 0.036929 0.006838

The policy definitions create several offline candidates with different levels of targeting. Comparing simple policies first makes it easier to understand IPS, SNIPS, and DR behavior before moving to contextual policies.

Define Contextual Policy Helpers

This cell converts predicted reward scores into policy probability matrices.

We define several learned policies:

  • lgbm_greedy: puts all probability on the predicted best item
  • lgbm_epsilon_greedy: puts most probability on the best item and keeps exploration mass elsewhere
  • lgbm_softmax: smoothly allocates probability based on predicted reward
  • lgbm_conservative_mix: mixes the learned softmax policy with uniform exploration

The conservative mix is often the most realistic offline candidate because it benefits from model scores while preserving broad support.

def softmax_policy(scores, temperature):
    centered = scores - scores.max(axis=1, keepdims=True)
    exp_scores = np.exp(centered / temperature)
    return exp_scores / exp_scores.sum(axis=1, keepdims=True)


def epsilon_greedy_policy(scores, epsilon=0.15):
    policy = np.full(scores.shape, epsilon / scores.shape[1])
    best_idx = scores.argmax(axis=1)
    policy[np.arange(scores.shape[0]), best_idx] += 1 - epsilon
    return policy


def greedy_policy(scores):
    policy = np.zeros_like(scores)
    best_idx = scores.argmax(axis=1)
    policy[np.arange(scores.shape[0]), best_idx] = 1.0
    return policy

SOFTMAX_TEMPERATURE = max(float(np.std(q_matrix)), 0.001)
SOFTMAX_TEMPERATURE
0.01352180357280611

These cells define policies that can choose different items for different user contexts. This is the transition from evaluating fixed item-distribution policies to evaluating learned contextual recommendation rules.

Build Fixed And Contextual Policy Matrices

Each policy matrix has one row per evaluation context and one column per action. Row i, column a is pi_e(a|x_i).

Fixed policies repeat the same probability vector for every row. Contextual policies vary by row because they depend on LightGBM predicted rewards for that context.

uniform_matrix = np.tile(uniform_probs, (len(eval_df), 1))
fixed_ctr_matrix = np.tile(fixed_ctr_probs, (len(eval_df), 1))
softmax_matrix = softmax_policy(q_matrix, temperature=SOFTMAX_TEMPERATURE)
epsilon_greedy_matrix = epsilon_greedy_policy(q_matrix, epsilon=0.15)
greedy_matrix = greedy_policy(q_matrix)
conservative_mix_matrix = 0.70 * uniform_matrix + 0.30 * softmax_matrix

policy_matrices = {
    "uniform": uniform_matrix,
    "fixed_ctr_weighted": fixed_ctr_matrix,
    "lgbm_softmax": softmax_matrix,
    "lgbm_epsilon_greedy": epsilon_greedy_matrix,
    "lgbm_greedy": greedy_matrix,
    "lgbm_conservative_mix": conservative_mix_matrix,
}

pd.DataFrame(
    {
        "policy": list(policy_matrices),
        "rows": [matrix.shape[0] for matrix in policy_matrices.values()],
        "actions": [matrix.shape[1] for matrix in policy_matrices.values()],
        "min_row_sum": [matrix.sum(axis=1).min() for matrix in policy_matrices.values()],
        "max_row_sum": [matrix.sum(axis=1).max() for matrix in policy_matrices.values()],
    }
)
policy rows actions min_row_sum max_row_sum
0 uniform 100000 34 1.000000 1.000000
1 fixed_ctr_weighted 100000 34 1.000000 1.000000
2 lgbm_softmax 100000 34 1.000000 1.000000
3 lgbm_epsilon_greedy 100000 34 1.000000 1.000000
4 lgbm_greedy 100000 34 1.000000 1.000000
5 lgbm_conservative_mix 100000 34 1.000000 1.000000

These cells define policies that can choose different items for different user contexts. This is the transition from evaluating fixed item-distribution policies to evaluating learned contextual recommendation rules.

Audit Contextual Policy Behavior

This cell summarizes how concentrated each policy is.

The most useful fields are:

  • avg_max_action_probability: how much probability the policy puts on its favorite action in an average context
  • normalized_entropy: how diffuse or concentrated the policy is
  • avg_logged_action_probability: how much probability the policy assigns to the action actually logged by the behavior policy

Aggressive policies tend to have high max probability and low entropy. Conservative policies preserve more exploration.

def normalized_entropy(policy_matrix):
    safe = np.clip(policy_matrix, 1e-15, 1)
    entropy = -(safe * np.log(safe)).sum(axis=1)
    return entropy / np.log(policy_matrix.shape[1])

logged_action_indices = eval_df["item_id"].map(action_to_index).to_numpy()
policy_audit_rows = []

for policy_name, matrix in policy_matrices.items():
    logged_probs = matrix[np.arange(len(eval_df)), logged_action_indices]
    policy_audit_rows.append(
        {
            "policy": policy_name,
            "min_row_sum": matrix.sum(axis=1).min(),
            "max_row_sum": matrix.sum(axis=1).max(),
            "avg_max_action_probability": matrix.max(axis=1).mean(),
            "p95_max_action_probability": np.percentile(matrix.max(axis=1), 95),
            "avg_normalized_entropy": normalized_entropy(matrix).mean(),
            "avg_logged_action_probability": logged_probs.mean(),
            "share_logged_action_probability_zero": (logged_probs == 0).mean(),
        }
    )

policy_audit = pd.DataFrame(policy_audit_rows).sort_values("avg_normalized_entropy")
policy_audit
policy min_row_sum max_row_sum avg_max_action_probability p95_max_action_probability avg_normalized_entropy avg_logged_action_probability share_logged_action_probability_zero
4 lgbm_greedy 1.000000 1.000000 1.000000 1.000000 0.000000 0.029100 0.970900
3 lgbm_epsilon_greedy 1.000000 1.000000 0.854412 0.854412 0.262035 0.029147 0.000000
2 lgbm_softmax 1.000000 1.000000 0.237347 0.999768 0.828504 0.029277 0.000000
1 fixed_ctr_weighted 1.000000 1.000000 0.080613 0.080613 0.965997 0.029323 0.000000
5 lgbm_conservative_mix 1.000000 1.000000 0.091792 0.320519 0.974767 0.029371 0.000000
0 uniform 1.000000 1.000000 0.029412 0.029412 1.000000 0.029412 0.000000

The contextual policy audit summarizes entropy, top-action concentration, and support properties. These diagnostics help separate a promising contextual policy from one that is too aggressive for reliable offline evaluation.

Plot Policy Concentration

This plot compares average maximum action probability with normalized entropy.

Policies in the upper-left area are aggressive: they put high probability on one action and have low entropy. Policies toward the lower-right are more exploratory. For offline OPE, very aggressive policies often have lower effective sample size.

fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(
    data=policy_audit,
    x="avg_normalized_entropy",
    y="avg_max_action_probability",
    hue="policy",
    s=120,
    ax=ax,
)
for _, row in policy_audit.iterrows():
    ax.text(row["avg_normalized_entropy"] + 0.005, row["avg_max_action_probability"], row["policy"], fontsize=9)
ax.set_title("Policy Concentration Diagnostics")
ax.set_xlabel("Average Normalized Entropy")
ax.set_ylabel("Average Max Action Probability")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.0%}")
plt.tight_layout()
plt.show()

The concentration metrics quantify how spread out the logging policy is across actions. Lower entropy or higher concentration means fewer actions dominate the log, increasing support risk.

Define OPE Estimator Helpers

This cell defines IPS, SNIPS, direct method, and doubly robust estimators for contextual policies.

For contextual policies, pi_e(A_i|X_i) comes from the policy matrix row for context i and the logged action column. The direct component is sum_a pi_e(a|x_i) q_hat(x_i, a).

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return weights.sum() ** 2 / np.square(weights).sum()


def summarize_signal(signal):
    signal = np.asarray(signal, dtype=float)
    estimate = signal.mean()
    se = signal.std(ddof=1) / np.sqrt(len(signal))
    return estimate, se, estimate - 1.96 * se, estimate + 1.96 * se


def estimate_contextual_policy(policy_name, policy_matrix, clip=None):
    reward = eval_df["click"].to_numpy(dtype=float)
    behavior_propensity = eval_df["propensity_score"].to_numpy(dtype=float)
    pi_logged = policy_matrix[np.arange(len(eval_df)), logged_action_indices]
    raw_weight = pi_logged / behavior_propensity
    weight = raw_weight if clip is None else np.minimum(raw_weight, clip)

    direct_component = (policy_matrix * q_matrix).sum(axis=1)
    correction = weight * (reward - q_logged)
    dr_signal = direct_component + correction

    ips_signal = weight * reward
    ips = summarize_signal(ips_signal)

    snips_estimate = ips_signal.sum() / weight.sum()
    snips_influence = weight * (reward - snips_estimate) / weight.mean()
    snips_se = snips_influence.std(ddof=1) / np.sqrt(len(snips_influence))
    snips = (snips_estimate, snips_se, snips_estimate - 1.96 * snips_se, snips_estimate + 1.96 * snips_se)

    dm = summarize_signal(direct_component)
    dr = summarize_signal(dr_signal)

    rows = []
    for estimator_name, values in [("IPS", ips), ("SNIPS", snips), ("DM", dm), ("DR", dr)]:
        estimate, se, lower, upper = values
        rows.append(
            {
                "policy": policy_name,
                "estimator": estimator_name,
                "clip": "none" if clip is None else clip,
                "estimate": estimate,
                "se": se,
                "ci_95_lower": lower,
                "ci_95_upper": upper,
                "mean_weight": weight.mean(),
                "max_weight": weight.max(),
                "p99_weight": np.percentile(weight, 99),
                "ess_share": effective_sample_size(weight) / len(weight),
                "mean_abs_correction": np.abs(correction).mean(),
                "mean_correction": correction.mean(),
                "avg_direct_component": direct_component.mean(),
                "observed_random_click_rate": eval_df["click"].mean(),
            }
        )
    return pd.DataFrame(rows)

The helper functions encode the estimator formulas and diagnostics used repeatedly in the notebook. Defining them once keeps the later policy comparisons consistent and easier to audit.

Estimate Fixed And Contextual Policy Values

This cell estimates each policy with IPS, SNIPS, direct method, and doubly robust OPE.

The preferred comparison for learned contextual policies is the DR estimate plus support diagnostics. Direct method can overtrust the model, while IPS/SNIPS can become noisy for aggressive policies.

policy_estimate_frames = []
for policy_name, policy_matrix in policy_matrices.items():
    policy_estimate_frames.append(estimate_contextual_policy(policy_name, policy_matrix, clip=None))

contextual_estimates = pd.concat(policy_estimate_frames, ignore_index=True)
contextual_estimates["lift_pp"] = 100 * (
    contextual_estimates["estimate"] - contextual_estimates["observed_random_click_rate"]
)
contextual_estimates["relative_lift_pct"] = 100 * (
    contextual_estimates["estimate"] / contextual_estimates["observed_random_click_rate"] - 1
)

contextual_estimates.sort_values(["estimator", "estimate"], ascending=[True, False])
policy estimator clip estimate se ci_95_lower ci_95_upper mean_weight max_weight p99_weight ess_share mean_abs_correction mean_correction avg_direct_component observed_random_click_rate lift_pp relative_lift_pct
18 lgbm_greedy DM none 0.039434 0.000194 0.039053 0.039815 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 0.039434 0.004980 3.445377 691.842735
14 lgbm_epsilon_greedy DM none 0.034253 0.000167 0.033926 0.034580 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 0.034253 0.004980 2.927343 587.819794
10 lgbm_softmax DM none 0.028713 0.000201 0.028319 0.029106 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 0.028713 0.004980 2.373265 476.559177
22 lgbm_conservative_mix DM none 0.012043 0.000068 0.011909 0.012176 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 0.012043 0.004980 0.706250 141.817278
6 fixed_ctr_weighted DM none 0.005973 0.000017 0.005940 0.006005 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.099258 19.931326
2 uniform DM none 0.004898 0.000013 0.004872 0.004924 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 -0.008185 -1.643536
19 lgbm_greedy DR none 0.007513 0.001866 0.003857 0.011170 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 0.039434 0.004980 0.253344 50.872251
15 lgbm_epsilon_greedy DR none 0.007142 0.001594 0.004016 0.010267 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 0.034253 0.004980 0.216160 43.405703
11 lgbm_softmax DR none 0.006001 0.001267 0.003518 0.008484 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 0.028713 0.004980 0.102106 20.503156
23 lgbm_conservative_mix DR none 0.005324 0.000454 0.004434 0.006215 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 0.012043 0.004980 0.034450 6.917631
7 fixed_ctr_weighted DR none 0.005281 0.000268 0.004757 0.005806 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.030119 6.048007
3 uniform DR none 0.005035 0.000226 0.004592 0.005477 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 0.005454 1.095263
16 lgbm_greedy IPS none 0.006120 0.001442 0.003293 0.008947 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 0.039434 0.004980 0.114000 22.891566
12 lgbm_epsilon_greedy IPS none 0.005949 0.001233 0.003533 0.008365 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 0.034253 0.004980 0.096900 19.457831
4 fixed_ctr_weighted IPS none 0.005172 0.000261 0.004660 0.005685 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.019244 3.864167
0 uniform IPS none 0.004980 0.000223 0.004544 0.005416 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 0.000000 0.000000
20 lgbm_conservative_mix IPS none 0.004972 0.000274 0.004436 0.005509 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 0.012043 0.004980 -0.000752 -0.150977
8 lgbm_softmax IPS none 0.004955 0.000576 0.003826 0.006084 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 0.028713 0.004980 -0.002506 -0.503258
17 lgbm_greedy SNIPS none 0.006186 0.001453 0.003337 0.009034 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 0.039434 0.004980 0.120557 24.208173
13 lgbm_epsilon_greedy SNIPS none 0.006003 0.001240 0.003572 0.008434 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 0.034253 0.004980 0.102309 20.543932
5 fixed_ctr_weighted SNIPS none 0.005188 0.000262 0.004675 0.005701 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.020806 4.178004
1 uniform SNIPS none 0.004980 0.000223 0.004544 0.005416 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 0.000000 0.000000
21 lgbm_conservative_mix SNIPS none 0.004979 0.000274 0.004443 0.005516 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 0.012043 0.004980 -0.000066 -0.013317
9 lgbm_softmax SNIPS none 0.004978 0.000577 0.003846 0.006109 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 0.028713 0.004980 -0.000222 -0.044532

The contextual OPE estimates compare learned policies against fixed baselines. The key question is whether personalization improves estimated value without creating unacceptable weight or support risk.

Plot OPE Estimates For Learned Policies

This plot compares IPS, SNIPS, and DR estimates. Direct method is omitted to keep the visual focused on estimators that incorporate logged outcomes through propensities or residual correction.

The key pattern to look for is whether contextual policies improve estimated value while keeping uncertainty and support acceptable.

plot_estimates = contextual_estimates.query("estimator in ['IPS', 'SNIPS', 'DR']").copy()
plot_estimates["lower_error"] = plot_estimates["estimate"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["estimate"]

policy_order = plot_estimates["policy"].drop_duplicates().tolist()
estimator_order = ["IPS", "SNIPS", "DR"]
offsets = {"IPS": -0.22, "SNIPS": 0.0, "DR": 0.22}
colors = {"IPS": "#F58518", "SNIPS": "#54A24B", "DR": "#B279A2"}

fig, ax = plt.subplots(figsize=(13, 5))
for estimator in estimator_order:
    subset = plot_estimates[plot_estimates["estimator"] == estimator]
    for _, row in subset.iterrows():
        x_base = policy_order.index(row["policy"])
        x = x_base + offsets[estimator]
        ax.errorbar(
            x=x,
            y=row["estimate"],
            yerr=[[row["lower_error"]], [row["upper_error"]]],
            fmt="o",
            color=colors[estimator],
            ecolor=colors[estimator],
            capsize=4,
            linewidth=1.4,
            markersize=6,
            label=estimator if row["policy"] == subset["policy"].iloc[0] else None,
        )

ax.axhline(eval_df["click"].mean(), color="black", linestyle="--", linewidth=1, label="Observed random")
ax.set_xticks(range(len(policy_order)))
ax.set_xticklabels(policy_order, rotation=25, ha="right")
ax.set_title("OPE Estimates For Fixed And Contextual Policies")
ax.set_xlabel("Evaluation Policy")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
ax.legend(title="Estimator")
plt.tight_layout()
plt.show()

The estimate plot compares policies on the same offline value scale. Error bars and estimator differences are just as important as the ranking, because high-variance estimates should not drive product decisions alone.

Weight And Support Diagnostics

This table focuses on the importance weights for each policy.

The most aggressive policies may have high estimated value but low effective sample size. For a realistic offline recommendation, we should prefer policies with both good value and defensible support.

weight_diagnostics = (
    contextual_estimates.query("estimator == 'DR'")
    [["policy", "mean_weight", "max_weight", "p99_weight", "ess_share", "mean_abs_correction", "mean_correction"]]
    .merge(policy_audit, on="policy", how="left")
    .sort_values("ess_share")
)

weight_diagnostics
policy mean_weight max_weight p99_weight ess_share mean_abs_correction mean_correction min_row_sum max_row_sum avg_max_action_probability p95_max_action_probability avg_normalized_entropy avg_logged_action_probability share_logged_action_probability_zero
4 lgbm_greedy 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 1.000000 1.000000 1.000000 1.000000 0.000000 0.029100 0.970900
3 lgbm_epsilon_greedy 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 1.000000 1.000000 0.854412 0.854412 0.262035 0.029147 0.000000
2 lgbm_softmax 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 1.000000 1.000000 0.237347 0.999768 0.828504 0.029277 0.000000
5 lgbm_conservative_mix 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 1.000000 1.000000 0.091792 0.320519 0.974767 0.029371 0.000000
1 fixed_ctr_weighted 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 1.000000 1.000000 0.080613 0.080613 0.965997 0.029323 0.000000
0 uniform 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 1.000000 1.000000 0.029412 0.029412 1.000000 0.029412 0.000000

This output is part of the contextual policy learning with offline evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Plot Weight Risk And Effective Sample Size

This plot separates two ideas:

  • the largest and tail weights, which flag variance risk
  • effective sample size, which summarizes how much useful support remains after weighting

This makes the tradeoff clear: aggressive contextual policies may improve predicted value, but they often pay for it with weaker support.

weight_plot_rows = []
for policy_name, policy_matrix in policy_matrices.items():
    pi_logged = policy_matrix[np.arange(len(eval_df)), logged_action_indices]
    weights = pi_logged / eval_df["propensity_score"].to_numpy()
    weight_plot_rows.append(
        {
            "policy": policy_name,
            "p50": np.percentile(weights, 50),
            "p90": np.percentile(weights, 90),
            "p99": np.percentile(weights, 99),
            "max": weights.max(),
            "ess_share": effective_sample_size(weights) / len(weights),
        }
    )

weight_plot_summary = pd.DataFrame(weight_plot_rows).sort_values("ess_share", ascending=False)
weight_percentile_plot = weight_plot_summary.melt(
    id_vars="policy",
    value_vars=["p50", "p90", "p99", "max"],
    var_name="statistic",
    value_name="weight",
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5), gridspec_kw={"width_ratios": [1.8, 1]})

sns.scatterplot(
    data=weight_percentile_plot,
    x="weight",
    y="policy",
    hue="statistic",
    style="statistic",
    s=90,
    ax=axes[0],
)
axes[0].set_xscale("log")
axes[0].set_title("Importance Weight Percentiles")
axes[0].set_xlabel("Weight, log scale")
axes[0].set_ylabel("Policy")
axes[0].axvline(1, color="black", linestyle="--", linewidth=1, alpha=0.6)
axes[0].legend(title="Statistic", bbox_to_anchor=(1.02, 1), loc="upper left")

sns.barplot(data=weight_plot_summary, x="ess_share", y="policy", color="#4C78A8", ax=axes[1])
axes[1].set_title("Effective Sample Size Share")
axes[1].set_xlabel("ESS / rows")
axes[1].set_ylabel("")
axes[1].set_xlim(0, 1.05)
axes[1].xaxis.set_major_formatter(lambda x, _: f"{x:.0%}")
for container in axes[1].containers:
    axes[1].bar_label(container, fmt=lambda x: f"{x:.0%}", padding=3)

plt.tight_layout()
plt.show()

weight_plot_summary

policy p50 p90 p99 max ess_share
0 uniform 1.000000 1.000000 1.000000 1.000000 1.000000
1 fixed_ctr_weighted 0.852995 1.563756 2.740842 2.740842 0.790628
5 lgbm_conservative_mix 0.966942 1.040916 2.234187 10.900000 0.694032
2 lgbm_softmax 0.889807 1.136388 5.113957 34.000000 0.168632
3 lgbm_epsilon_greedy 0.150000 0.150000 29.050000 29.050000 0.039955
4 lgbm_greedy 0.000000 0.000000 34.000000 34.000000 0.029100

Effective sample size turns weight concentration into an intuitive sample-size diagnostic. A low ESS means the estimator has less usable information than the raw row count suggests.

Weight Clipping Sensitivity For Contextual Policies

This section checks whether learned policy estimates depend on extreme weights.

We compare no clipping with caps at 5, 10, and 20. A policy whose DR estimate changes sharply under clipping is less stable. A policy whose DR estimate stays similar is more credible for offline recommendation.

clip_values = [None, 5, 10, 20]
clipping_frames = []

for clip in clip_values:
    for policy_name, policy_matrix in policy_matrices.items():
        clipping_frames.append(estimate_contextual_policy(policy_name, policy_matrix, clip=clip))

clipping_sensitivity = pd.concat(clipping_frames, ignore_index=True)
clipping_sensitivity["lift_pp"] = 100 * (
    clipping_sensitivity["estimate"] - clipping_sensitivity["observed_random_click_rate"]
)

clipping_sensitivity.query("estimator == 'DR'").head(12)
policy estimator clip estimate se ci_95_lower ci_95_upper mean_weight max_weight p99_weight ess_share mean_abs_correction mean_correction avg_direct_component observed_random_click_rate lift_pp
3 uniform DR none 0.005035 0.000226 0.004592 0.005477 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 0.005454
7 fixed_ctr_weighted DR none 0.005281 0.000268 0.004757 0.005806 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.030119
11 lgbm_softmax DR none 0.006001 0.001267 0.003518 0.008484 0.995411 34.000000 5.113957 0.168632 0.032162 -0.022712 0.028713 0.004980 0.102106
15 lgbm_epsilon_greedy DR none 0.007142 0.001594 0.004016 0.010267 0.990990 29.050000 29.050000 0.039955 0.038508 -0.027112 0.034253 0.004980 0.216160
19 lgbm_greedy DR none 0.007513 0.001866 0.003857 0.011170 0.989400 34.000000 34.000000 0.029100 0.043579 -0.031920 0.039434 0.004980 0.253344
23 lgbm_conservative_mix DR none 0.005324 0.000454 0.004434 0.006215 0.998623 10.900000 2.234187 0.694032 0.016486 -0.006718 0.012043 0.004980 0.034450
27 uniform DR 5 0.005035 0.000226 0.004592 0.005477 1.000000 1.000000 1.000000 1.000000 0.009768 0.000136 0.004898 0.004980 0.005454
31 fixed_ctr_weighted DR 5 0.005281 0.000268 0.004757 0.005806 0.996987 2.740842 2.740842 0.790628 0.010967 -0.000691 0.005973 0.004980 0.030119
35 lgbm_softmax DR 5 0.024611 0.000343 0.023938 0.025284 0.851704 5.000000 5.000000 0.650776 0.012142 -0.004102 0.028713 0.004980 1.963055
39 lgbm_epsilon_greedy DR 5 0.029721 0.000311 0.029112 0.030329 0.291135 5.000000 5.000000 0.113112 0.007682 -0.004533 0.034253 0.004980 2.474054
43 lgbm_greedy DR 5 0.034740 0.000323 0.034107 0.035372 0.145500 5.000000 5.000000 0.029100 0.006409 -0.004694 0.039434 0.004980 2.975960
47 lgbm_conservative_mix DR 5 0.008822 0.000295 0.008243 0.009400 0.975307 5.000000 2.234187 0.876047 0.012794 -0.003221 0.012043 0.004980 0.384163

The clipping sensitivity check shows how estimates change when extreme weights are capped. Stable estimates across clipping thresholds are more reassuring than estimates that depend strongly on a few high-weight rows.

Plot DR Clipping Sensitivity

This plot shows how DR estimates move as weights are clipped.

Greedy and epsilon-greedy policies are expected to be more sensitive because they concentrate probability on fewer actions. The conservative mix should be more stable because it keeps substantial uniform exploration mass.

dr_clipping_plot = clipping_sensitivity.query("estimator == 'DR'").copy()
dr_clipping_plot["clip_label"] = dr_clipping_plot["clip"].astype(str)

fig, ax = plt.subplots(figsize=(11, 5))
sns.lineplot(
    data=dr_clipping_plot,
    x="clip_label",
    y="estimate",
    hue="policy",
    marker="o",
    ax=ax,
)
ax.axhline(eval_df["click"].mean(), color="black", linestyle="--", linewidth=1, label="Observed random")
ax.set_title("DR Estimate Sensitivity To Weight Clipping")
ax.set_xlabel("Weight Clip")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
plt.tight_layout()
plt.show()

The clipping sensitivity check shows how estimates change when extreme weights are capped. Stable estimates across clipping thresholds are more reassuring than estimates that depend strongly on a few high-weight rows.

Clipping Stability Summary

This cell summarizes how much each policy’s DR estimate changes across clipping thresholds.

A smaller estimate range means the policy estimate is less dependent on extreme weights. This is useful when deciding whether an aggressive learned policy is too risky for offline recommendation.

clipping_stability = (
    clipping_sensitivity.query("estimator == 'DR'")
    .groupby("policy")
    .agg(
        min_dr_estimate=("estimate", "min"),
        max_dr_estimate=("estimate", "max"),
        dr_estimate_range=("estimate", lambda x: x.max() - x.min()),
        min_ess_share=("ess_share", "min"),
        max_weight=("max_weight", "max"),
    )
    .reset_index()
    .sort_values("dr_estimate_range", ascending=False)
)

clipping_stability
policy min_dr_estimate max_dr_estimate dr_estimate_range min_ess_share max_weight
3 lgbm_greedy 0.007513 0.034740 0.027226 0.029100 34.000000
2 lgbm_epsilon_greedy 0.007142 0.029721 0.022579 0.039955 29.050000
4 lgbm_softmax 0.006001 0.024611 0.018609 0.168632 34.000000
1 lgbm_conservative_mix 0.005324 0.008822 0.003497 0.694032 10.900000
0 fixed_ctr_weighted 0.005281 0.005281 0.000000 0.790628 2.740842
5 uniform 0.005035 0.005035 0.000000 1.000000 1.000000

The clipping sensitivity check shows how estimates change when extreme weights are capped. Stable estimates across clipping thresholds are more reassuring than estimates that depend strongly on a few high-weight rows.

Conservative Policy Improvement Table

This table creates a support-adjusted score for policy selection.

The score is:

support_adjusted_lift_pp = DR lift in percentage points * sqrt(ESS share)

This is not a formal causal estimator. It is a decision aid that penalizes policies with weak effective sample size. It helps prevent the notebook from recommending a policy only because it has an aggressive point estimate.

dr_results = contextual_estimates.query("estimator == 'DR'").copy()
dr_results = dr_results.merge(
    clipping_stability[["policy", "dr_estimate_range"]], on="policy", how="left"
)
dr_results["support_adjusted_lift_pp"] = dr_results["lift_pp"] * np.sqrt(dr_results["ess_share"])
dr_results["risk_flag"] = np.select(
    [
        dr_results["ess_share"] < 0.10,
        dr_results["dr_estimate_range"] > 0.001,
        dr_results["max_weight"] > 20,
    ],
    ["low ESS", "clip sensitive", "large max weight"],
    default="reasonable support",
)

policy_decision_table = dr_results[
    [
        "policy",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "ess_share",
        "max_weight",
        "p99_weight",
        "dr_estimate_range",
        "support_adjusted_lift_pp",
        "risk_flag",
    ]
].sort_values("support_adjusted_lift_pp", ascending=False)

policy_decision_table
policy estimate ci_95_lower ci_95_upper lift_pp relative_lift_pct ess_share max_weight p99_weight dr_estimate_range support_adjusted_lift_pp risk_flag
4 lgbm_greedy 0.007513 0.003857 0.011170 0.253344 50.872251 0.029100 34.000000 34.000000 0.027226 0.043217 low ESS
3 lgbm_epsilon_greedy 0.007142 0.004016 0.010267 0.216160 43.405703 0.039955 29.050000 29.050000 0.022579 0.043208 low ESS
2 lgbm_softmax 0.006001 0.003518 0.008484 0.102106 20.503156 0.168632 34.000000 5.113957 0.018609 0.041930 clip sensitive
5 lgbm_conservative_mix 0.005324 0.004434 0.006215 0.034450 6.917631 0.694032 10.900000 2.234187 0.003497 0.028700 clip sensitive
1 fixed_ctr_weighted 0.005281 0.004757 0.005806 0.030119 6.048007 0.790628 2.740842 2.740842 0.000000 0.026781 reasonable support
0 uniform 0.005035 0.004592 0.005477 0.005454 1.095263 1.000000 1.000000 1.000000 0.000000 0.005454 reasonable support

The conservative improvement table discounts aggressive policies by risk diagnostics. This produces a more deployment-minded ranking than simply sorting by the largest estimated value.

Plain-English Interpretation

This cell writes a short explanation of the advanced policy-learning result.

The wording is intentionally careful. Offline OPE can identify a good candidate for online testing, but it cannot replace an A/B test. The recommendation should be framed as a next experiment, not a production guarantee.

interpretation = f"""
Advanced policy-learning result

The strongest offline candidate by support-adjusted DR lift is: {recommended_contextual_policy['policy']}.

Why this policy is a reasonable A/B-test candidate:
- It uses context-aware LightGBM reward predictions instead of fixed item probabilities.
- It is evaluated with doubly robust OPE, not only direct model predictions.
- Its support diagnostics are included in the recommendation, so the decision is not based on point estimate alone.
- It is compared against fixed baselines and more aggressive learned policies.

What we should not claim:
- This offline result does not prove production lift.
- It does not measure long-term satisfaction, retention, creator/item ecosystem effects, or novelty fatigue.
- Aggressive learned policies may still be risky if support is weak, even when predicted reward looks high.

Recommended next step:
Take the selected contextual policy, or an even more conservative mix of it with the random policy, into an online A/B test with guardrail metrics.
""".strip()

print(interpretation)
Advanced policy-learning result

The strongest offline candidate by support-adjusted DR lift is: lgbm_greedy.

Why this policy is a reasonable A/B-test candidate:
- It uses context-aware LightGBM reward predictions instead of fixed item probabilities.
- It is evaluated with doubly robust OPE, not only direct model predictions.
- Its support diagnostics are included in the recommendation, so the decision is not based on point estimate alone.
- It is compared against fixed baselines and more aggressive learned policies.

What we should not claim:
- This offline result does not prove production lift.
- It does not measure long-term satisfaction, retention, creator/item ecosystem effects, or novelty fatigue.
- Aggressive learned policies may still be risky if support is weak, even when predicted reward looks high.

Recommended next step:
Take the selected contextual policy, or an even more conservative mix of it with the random policy, into an online A/B test with guardrail metrics.

This text cell is the guardrail against overclaiming. Offline evaluation can recommend candidates for online testing, but it cannot replace an A/B test when business-critical deployment decisions are at stake.

Save Contextual Policy Outputs

This final code cell saves the important advanced-policy tables to the off-policy evaluation writeup folder.

These outputs can be reused in a final report notebook without rerunning the contextual policy scoring step.

WRITEUP_DIR = PROJECT_ROOT / "notebooks/projects/project_2_off_policy_evaluation/writeup"
TABLE_DIR = WRITEUP_DIR / "tables"
TABLE_DIR.mkdir(parents=True, exist_ok=True)

contextual_estimates.to_csv(TABLE_DIR / "contextual_policy_estimates.csv", index=False)
policy_audit.to_csv(TABLE_DIR / "contextual_policy_audit.csv", index=False)
weight_diagnostics.to_csv(TABLE_DIR / "contextual_weight_diagnostics.csv", index=False)
clipping_stability.to_csv(TABLE_DIR / "contextual_clipping_stability.csv", index=False)
policy_decision_table.to_csv(TABLE_DIR / "contextual_policy_decision_table.csv", index=False)

artifact_table = pd.DataFrame(
    {
        "path": [str(path.relative_to(PROJECT_ROOT)) for path in sorted(TABLE_DIR.glob("contextual_*.csv"))],
        "size_kb": [path.stat().st_size / 1024 for path in sorted(TABLE_DIR.glob("contextual_*.csv"))],
    }
)

artifact_table
path size_kb
0 notebooks/projects/project_2_off_policy_evaluation/writeup/tables... 0.702148
1 notebooks/projects/project_2_off_policy_evaluation/writeup/tables... 0.830078
2 notebooks/projects/project_2_off_policy_evaluation/writeup/tables... 1.428711
3 notebooks/projects/project_2_off_policy_evaluation/writeup/tables... 6.987305
4 notebooks/projects/project_2_off_policy_evaluation/writeup/tables... 1.569336

This cell saves reusable outputs for downstream notebooks or the final writeup. Persisting these artifacts makes the project modular and prevents later notebooks from repeating expensive or fragile setup work.

Notebook 6 Takeaways

This notebook moved the work from evaluating fixed item-probability policies to learning context-aware policies.

The key lessons are:

  • Contextual reward models can score every candidate item for every context.
  • Greedy learned policies may look attractive but often create support risk.
  • Conservative contextual policies can balance personalization with stable OPE diagnostics.
  • Doubly robust OPE is the right estimator family for these learned policies because it combines model predictions with logged residual corrections.
  • The best offline recommendation is the policy with value and stability, not the most aggressive policy.

At this point, the off-policy evaluation work has a strong technical arc. The next notebook should be a concise final report notebook that packages figures, tables, limitations, and resume-ready language.