04 Doubly Robust Off-Policy Evaluation

This notebook upgrades the analysis from pure importance weighting to doubly robust off-policy evaluation.

Notebook 3 estimated policy values with IPS and SNIPS. Those estimators are valuable because they use the known logging propensities directly, but they can become noisy when importance weights are large or concentrated. Doubly robust OPE adds a second ingredient: a reward model.

The reward model estimates:

q(x, a) = E[Y | X = x, A = a]

For this project, Y is click, X is recommendation context, and A is the recommended item. The doubly robust estimator combines:

a model-based prediction of what each policy would get in each context
an IPS-style correction for the residual error on the action that was actually logged

This notebook keeps the same evaluation policies from Notebook 3 so that the comparison is clear: IPS, SNIPS, direct method, and doubly robust estimates are all answering the same policy-value question.

Why Doubly Robust OPE?

IPS estimates policy value by weighting observed rewards:

mean((pi_e(A|X) / pi_b(A|X)) * Y)

This is powerful because it does not require a reward model, but it can be high variance. If the behavior policy rarely chose an action that the evaluation policy likes, the importance weight can become large.

The direct method takes the opposite approach. It trains a reward model q_hat(x, a) and estimates policy value by averaging predicted rewards under the evaluation policy:

mean(sum_a pi_e(a|x) * q_hat(x, a))

The direct method can be stable, but it is biased if the reward model is wrong.

The doubly robust estimator combines both:

mean(sum_a pi_e(a|x) * q_hat(x, a) + (pi_e(A|X) / pi_b(A|X)) * (Y - q_hat(X, A)))

The first term is the model-based policy value. The second term corrects the model using observed residuals on logged actions. It is called doubly robust because, under standard assumptions, it can remain consistent if either the propensities are correct or the reward model is correct.

Notebook Setup

This cell imports data, modeling, evaluation, and plotting libraries. It also suppresses one known LightGBM/sklearn feature-name metadata warning so the notebook output stays readable while real warnings remain visible.

We train two reward models:

logistic regression as a transparent baseline
LightGBM as a stronger nonlinear model

Using both models is useful because doubly robust OPE should not be treated as a black box. If DR estimates move sharply across reward models, that is a sensitivity warning.

from pathlib import Path
import warnings

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, brier_score_loss, log_loss, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names, but LGBMClassifier was fitted with feature names",
    category=UserWarning,
)

This cell prepares the notebook environment for doubly robust off-policy evaluation. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Locate The Cached Open Bandit Sample

This cell finds the repository root and loads the random-policy sample created in Notebook 1.

We use the random-policy log for the main DR estimates because Notebook 2 showed that it has broad support and stable propensities. This makes it the cleanest source for introducing a reward model and doubly robust correction.

RANDOM_SAMPLE_RELATIVE_PATH = Path("data/processed/open_bandit_random_men_sample.parquet")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / RANDOM_SAMPLE_RELATIVE_PATH).exists()
)

RANDOM_SAMPLE_PATH = PROJECT_ROOT / RANDOM_SAMPLE_RELATIVE_PATH
random_df = pd.read_parquet(RANDOM_SAMPLE_PATH).sort_values("timestamp").reset_index(drop=True)

pd.Series(
    {
        "project_root": PROJECT_ROOT,
        "random_sample_path": RANDOM_SAMPLE_PATH,
        "rows": len(random_df),
        "columns": random_df.shape[1],
        "observed_click_rate": random_df["click"].mean(),
    }
).to_frame("value")

	value
project_root	/home/apex/Documents/ranking_sys
random_sample_path	/home/apex/Documents/ranking_sys/data/processe...
rows	200000
columns	50
observed_click_rate	0.005190

The printed paths are a reproducibility checkpoint. Once the notebook can find the cached data and writeup folders, the rest of the analysis can run without manual path edits.

Recreate The Train And Evaluation Split

To stay aligned with Notebook 3, this notebook uses the same time-based split:

the first half of rows defines policies and trains reward models
the second half evaluates policy values

This separation matters. We should not tune a policy or reward model on the exact same rows used for final OPE without acknowledging the risk of overfitting.

SPLIT_FRACTION = 0.50
split_idx = int(len(random_df) * SPLIT_FRACTION)

train_df = random_df.iloc[:split_idx].copy()
eval_df = random_df.iloc[split_idx:].copy()

split_summary = pd.DataFrame(
    {
        "split": ["train", "evaluation"],
        "rows": [len(train_df), len(eval_df)],
        "min_timestamp": [train_df["timestamp"].min(), eval_df["timestamp"].min()],
        "max_timestamp": [train_df["timestamp"].max(), eval_df["timestamp"].max()],
        "click_rate": [train_df["click"].mean(), eval_df["click"].mean()],
    }
)

split_summary

	split	rows	min_timestamp	max_timestamp	click_rate
0	train	100000	2019-11-24 00:00:03.800821+00:00	2019-11-25 10:01:18.392921+00:00	0.005400
1	evaluation	100000	2019-11-25 10:01:18.393450+00:00	2019-11-27 02:50:16.027289+00:00	0.004980

The split separates policy construction from policy evaluation. This prevents using the same rows to design a policy and evaluate it, which would make the offline result too optimistic.

Define The Action Space And Feature Groups

The action is the recommended item. In the men campaign there are 34 item actions.

For reward modeling, we build features that are available before the click outcome:

user categorical features
recommendation position and hour
candidate item ID
candidate item metadata
selected user-item affinity for the candidate item

The selected affinity is important. The raw log has one affinity column per item. When we score a candidate action, we use the affinity column corresponding to that candidate item.

action_space = np.array(sorted(random_df["item_id"].unique()))
n_actions = len(action_space)

user_feature_cols = [col for col in random_df.columns if col.startswith("user_feature_")]
affinity_cols_by_action = [f"user-item_affinity_{item_id}" for item_id in action_space]
item_feature_cols = [col for col in random_df.columns if col.startswith("item_feature_")]

categorical_features = [
    "position",
    "hour",
    "item_id",
    *user_feature_cols,
    "item_feature_1",
    "item_feature_2",
    "item_feature_3",
]
numeric_features = ["selected_affinity", "item_feature_0"]
feature_cols = categorical_features + numeric_features

feature_group_summary = pd.DataFrame(
    {
        "group": ["actions", "user categorical", "item metadata", "model categorical", "model numeric"],
        "count": [n_actions, len(user_feature_cols), len(item_feature_cols), len(categorical_features), len(numeric_features)],
        "columns": [list(action_space), user_feature_cols, item_feature_cols, categorical_features, numeric_features],
    }
)

feature_group_summary

	group	count	columns
0	actions	34	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1	user categorical	4	[user_feature_0, user_feature_1, user_feature_...
2	item metadata	4	[item_feature_0, item_feature_1, item_feature_...
3	model categorical	10	[position, hour, item_id, user_feature_0, user...
4	model numeric	2	[selected_affinity, item_feature_0]

The action-space definition fixes the set of items that evaluation policies are allowed to choose. OPE estimates are only meaningful for policies whose action probabilities live inside this logged support.

Build Item Metadata Lookup

The cached random sample already contains item metadata for the logged item. To score counterfactual candidate actions, we need one metadata row per item.

This cell builds an item lookup table. Later, when we score item a for a context, we attach the metadata for item a, not the metadata for the item that happened to be logged in that row.

item_context = (
    random_df[["item_id", *item_feature_cols]]
    .drop_duplicates("item_id")
    .set_index("item_id")
    .sort_index()
)

missing_items = sorted(set(action_space) - set(item_context.index))
if missing_items:
    raise ValueError(f"Missing item metadata for item IDs: {missing_items}")

item_context.head()

	item_feature_0	item_feature_1	item_feature_2	item_feature_3
item_id
0	-0.677183	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
1	-0.720300	3c2985d744e0d57c261abd7e541e4263	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
2	0.745662	3c2985d744e0d57c261abd7e541e4263	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759
3	-0.698741	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d
4	1.651109	01fe2f187e459e6ada960671d2942dfe	b4b5879029fb5f64eeec63cf4f73ef0e	b61cfaadd526b816e3aeb9b7be4b4759

These feature-building cells define the context used by reward models. Reward models need both user context and candidate item context so they can predict counterfactual rewards for actions that were not logged.

Create Logged-Action Reward Model Features

This helper creates a supervised-learning frame for rows where the logged action is known.

For each row, selected_affinity is taken from the affinity column that corresponds to the logged item_id. This makes the training data match the counterfactual scoring data we will create later: every feature row represents a specific context-action pair.

def make_logged_feature_frame(df):
    """Create reward-model features for the action that was actually logged."""
    frame = df.copy()
    affinity_matrix = frame[affinity_cols_by_action].to_numpy()
    item_positions = frame["item_id"].map({item_id: idx for idx, item_id in enumerate(action_space)}).to_numpy()
    frame["selected_affinity"] = affinity_matrix[np.arange(len(frame)), item_positions]
    return frame[feature_cols]

X_train = make_logged_feature_frame(train_df)
y_train = train_df["click"].astype(int)
X_eval_logged = make_logged_feature_frame(eval_df)
y_eval = eval_df["click"].astype(int)

X_train.head()

	position	item_id	user_feature_0	user_feature_1	user_feature_2	user_feature_3	item_feature_1	item_feature_2	item_feature_3	item_feature_0
0	1	0	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.677183
1	3	25	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.461600
2	2	23	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	55fe518d85813954c7d9b8a875ff2453	cc75031396a5aa830885915aa93f49d0	b61cfaadd526b816e3aeb9b7be4b4759	-0.569392
3	1	25	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.461600
4	2	30	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44	61c5d8c2524684aa047e15e172c7e92f	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	-0.914324

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Check Reward Model Feature Quality

This cell checks missingness and target balance for the reward-model training frame.

Reward modeling is now part of the causal estimator. If feature construction is broken, the direct method and doubly robust estimates will be misleading. This quick check catches obvious issues before model fitting.

feature_quality = pd.DataFrame(
    {
        "feature": feature_cols,
        "missing_rate_train": [X_train[col].isna().mean() for col in feature_cols],
        "missing_rate_eval": [X_eval_logged[col].isna().mean() for col in feature_cols],
        "train_unique_values": [X_train[col].nunique(dropna=False) for col in feature_cols],
    }
)

target_summary = pd.Series(
    {
        "train_rows": len(y_train),
        "eval_rows": len(y_eval),
        "train_click_rate": y_train.mean(),
        "eval_click_rate": y_eval.mean(),
    }
).to_frame("value")

feature_quality, target_summary

(              feature  missing_rate_train  missing_rate_eval  \
 0            position            0.000000           0.000000   
 1                hour            0.000000           0.000000   
 2             item_id            0.000000           0.000000   
 3      user_feature_0            0.000000           0.000000   
 4      user_feature_1            0.000000           0.000000   
 5      user_feature_2            0.000000           0.000000   
 6      user_feature_3            0.000000           0.000000   
 7      item_feature_1            0.000000           0.000000   
 8      item_feature_2            0.000000           0.000000   
 9      item_feature_3            0.000000           0.000000   
 10  selected_affinity            0.000000           0.000000   
 11     item_feature_0            0.000000           0.000000   
 
     train_unique_values  
 0                     3  
 1                    24  
 2                    34  
 3                     4  
 4                     6  
 5                    10  
 6                    10  
 7                     7  
 8                    16  
 9                     4  
 10                    4  
 11                   25  ,
                          value
 train_rows       100000.000000
 eval_rows        100000.000000
 train_click_rate      0.005400
 eval_click_rate       0.004980)

The feature-quality table checks whether the reward-model inputs are present and informative. If important context fields were missing or constant, direct method and DR estimates would lean on weak predictions.

Define The Preprocessor

This cell defines the preprocessing shared by the reward models.

Categorical features are one-hot encoded. Numeric features are scaled. The final transformed matrix is then passed to logistic regression or LightGBM.

The preprocessing is intentionally standard and auditable. For OPE, a slightly weaker but understandable reward model is often better for a portfolio notebook than a highly tuned model whose behavior is hard to explain.

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features),
        ("numeric", StandardScaler(), numeric_features),
    ],
    remainder="drop",
)

The weight diagnostics show whether each policy value estimate is supported by enough logged data. Policies with large weights or low ESS may look attractive but carry higher offline evaluation risk.

Train Reward Models

This cell trains two reward models.

The logistic regression model is a baseline with a simple functional form. The LightGBM model can capture nonlinear interactions between context, item metadata, and affinity. We do not use class weighting because the model probabilities should estimate click probability, not rebalance the classification objective for a different decision threshold.

reward_models = {
    "logistic": Pipeline(
        steps=[
            ("preprocess", preprocessor),
            ("model", LogisticRegression(max_iter=500, solver="lbfgs")),
        ]
    ),
    "lightgbm": Pipeline(
        steps=[
            ("preprocess", preprocessor),
            (
                "model",
                lgb.LGBMClassifier(
                    n_estimators=200,
                    learning_rate=0.05,
                    num_leaves=31,
                    min_child_samples=100,
                    subsample=0.85,
                    colsample_bytree=0.85,
                    random_state=42,
                    verbose=-1,
                ),
            ),
        ]
    ),
}

for model_name, model in reward_models.items():
    model.fit(X_train, y_train)

list(reward_models)

['logistic', 'lightgbm']

The reward models learn expected click probability from logged action-context pairs. These predictions power the direct method and the model-based part of the doubly robust estimator.

Evaluate Reward Model Quality

This cell evaluates each reward model on the held-out logged actions.

The most relevant metrics are probability metrics, not classification accuracy. AUC and average precision measure ranking quality. Log loss and Brier score measure probability quality. Since direct method and DR use predicted probabilities, calibration matters.

model_eval_rows = []
logged_predictions = {}

for model_name, model in reward_models.items():
    pred = model.predict_proba(X_eval_logged)[:, 1]
    logged_predictions[model_name] = pred
    model_eval_rows.append(
        {
            "reward_model": model_name,
            "auc": roc_auc_score(y_eval, pred),
            "average_precision": average_precision_score(y_eval, pred),
            "log_loss": log_loss(y_eval, pred, labels=[0, 1]),
            "brier_score": brier_score_loss(y_eval, pred),
            "mean_prediction": pred.mean(),
            "observed_click_rate": y_eval.mean(),
        }
    )

reward_model_metrics = pd.DataFrame(model_eval_rows).sort_values("log_loss")
reward_model_metrics

	reward_model	auc	average_precision	log_loss	brier_score	mean_prediction	observed_click_rate
0	logistic	0.539928	0.006395	0.032528	0.005081	0.005774	0.004980
1	lightgbm	0.533007	0.005812	0.034352	0.005104	0.004888	0.004980

The reward-model diagnostics show whether predicted click probabilities align with observed clicks. Calibration matters because DR uses these predictions as counterfactual baselines before applying residual correction.

Plot Reward Model Calibration

Calibration checks whether predicted probabilities line up with observed click rates. This cell groups predictions into deciles and compares mean predicted click probability with actual click rate.

A perfectly calibrated model would lie near the diagonal. In sparse click data, calibration curves can be noisy, but large systematic gaps are important because direct method estimates depend heavily on probability calibration.

calibration_frames = []
for model_name, pred in logged_predictions.items():
    calibration_frame = pd.DataFrame({"prediction": pred, "click": y_eval.to_numpy()})
    calibration_frame["prediction_bin"] = pd.qcut(
        calibration_frame["prediction"].rank(method="first"), q=10, labels=False
    )
    calibration_summary = (
        calibration_frame.groupby("prediction_bin")
        .agg(mean_prediction=("prediction", "mean"), observed_click_rate=("click", "mean"), rows=("click", "size"))
        .reset_index()
    )
    calibration_summary["reward_model"] = model_name
    calibration_frames.append(calibration_summary)

calibration_df = pd.concat(calibration_frames, ignore_index=True)

fig, ax = plt.subplots(figsize=(6, 5))
sns.lineplot(
    data=calibration_df,
    x="mean_prediction",
    y="observed_click_rate",
    hue="reward_model",
    marker="o",
    ax=ax,
)
max_axis = max(calibration_df["mean_prediction"].max(), calibration_df["observed_click_rate"].max()) * 1.1
ax.plot([0, max_axis], [0, max_axis], color="black", linestyle="--", linewidth=1)
ax.set_title("Reward Model Calibration By Prediction Decile")
ax.set_xlabel("Mean Predicted Click Probability")
ax.set_ylabel("Observed Click Rate")
ax.xaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
plt.tight_layout()
plt.show()

calibration_df

	prediction_bin	mean_prediction	observed_click_rate	rows	reward_model
0	0	0.001292	0.004000	10000	logistic
1	1	0.001987	0.004200	10000	logistic
2	2	0.002585	0.004200	10000	logistic
3	3	0.003192	0.006300	10000	logistic
4	4	0.003855	0.003600	10000	logistic
5	5	0.004663	0.006100	10000	logistic
6	6	0.005678	0.003600	10000	logistic
7	7	0.007076	0.005000	10000	logistic
8	8	0.009268	0.004400	10000	logistic
9	9	0.018137	0.008400	10000	logistic
10	0	0.000269	0.002900	10000	lightgbm
11	1	0.000718	0.004300	10000	lightgbm
12	2	0.001127	0.004600	10000	lightgbm
13	3	0.001629	0.005700	10000	lightgbm
14	4	0.002242	0.005800	10000	lightgbm
15	5	0.002954	0.006100	10000	lightgbm
16	6	0.003809	0.004100	10000	lightgbm
17	7	0.005085	0.005000	10000	lightgbm
18	8	0.007362	0.004200	10000	lightgbm
19	9	0.023685	0.007100	10000	lightgbm

Recreate Evaluation Policies From Notebook 3

This cell rebuilds the same four evaluation policies used in Notebook 3:

uniform
exposure_popularity
ctr_weighted
epsilon_greedy_top_ctr

Keeping the policies identical allows us to compare IPS, SNIPS, direct method, and DR estimates without changing the target estimand.

SMOOTHING_ALPHA = 50
train_global_ctr = train_df["click"].mean()

item_stats = (
    train_df.groupby("item_id")
    .agg(train_impressions=("click", "size"), train_clicks=("click", "sum"), train_ctr=("click", "mean"))
    .reindex(action_space, fill_value=0)
    .rename_axis("item_id")
    .reset_index()
)
item_stats["smoothed_ctr"] = (
    item_stats["train_clicks"] + SMOOTHING_ALPHA * train_global_ctr
) / (item_stats["train_impressions"] + SMOOTHING_ALPHA)
item_stats["train_exposure_share"] = item_stats["train_impressions"] / item_stats["train_impressions"].sum()


def normalize_probabilities(values):
    values = np.asarray(values, dtype=float)
    values = np.clip(values, 0, None)
    total = values.sum()
    if total <= 0:
        raise ValueError("Policy scores must have positive total mass.")
    return values / total

uniform_probs = np.full(n_actions, 1 / n_actions)
exposure_popularity_probs = normalize_probabilities(item_stats["train_exposure_share"].to_numpy())
ctr_weighted_probs = normalize_probabilities(item_stats["smoothed_ctr"].to_numpy())

epsilon = 0.15
epsilon_greedy_probs = np.full(n_actions, epsilon / n_actions)
top_ctr_index = int(item_stats["smoothed_ctr"].to_numpy().argmax())
epsilon_greedy_probs[top_ctr_index] += 1 - epsilon

policy_cols = ["uniform", "exposure_popularity", "ctr_weighted", "epsilon_greedy_top_ctr"]
policy_probability_df = pd.DataFrame(
    {
        "item_id": action_space,
        "uniform": uniform_probs,
        "exposure_popularity": exposure_popularity_probs,
        "ctr_weighted": ctr_weighted_probs,
        "epsilon_greedy_top_ctr": epsilon_greedy_probs,
    }
)

policy_probability_df.head()

	item_id	uniform	exposure_popularity	ctr_weighted	epsilon_greedy_top_ctr
0	0	0.029412	0.029200	0.040498	0.004412
1	1	0.029412	0.027590	0.025514	0.004412
2	2	0.029412	0.032070	0.036929	0.004412
3	3	0.029412	0.029650	0.020188	0.004412
4	4	0.029412	0.028410	0.017318	0.004412

Audit Evaluation Policy Probabilities

Before using policy probabilities in an estimator, we check that each policy is valid.

A policy must assign probabilities that sum to 1 and must not assign negative probabilities. In this notebook, every candidate policy also gives every action positive probability, which keeps support diagnostics straightforward.

policy_audit = pd.DataFrame(
    [
        {
            "policy": policy,
            "probability_sum": policy_probability_df[policy].sum(),
            "min_probability": policy_probability_df[policy].min(),
            "max_probability": policy_probability_df[policy].max(),
            "positive_actions": int((policy_probability_df[policy] > 0).sum()),
        }
        for policy in policy_cols
    ]
)

policy_audit

	policy	probability_sum	min_probability	max_probability	positive_actions
0	uniform	1.000000	0.029412	0.029412	34
1	exposure_popularity	1.000000	0.025810	0.033300	34
2	ctr_weighted	1.000000	0.009558	0.080613	34
3	epsilon_greedy_top_ctr	1.000000	0.004412	0.854412	34

Create Candidate-Action Feature Frames

The direct method and DR estimator need reward predictions for every candidate action in every evaluation context.

This helper builds a feature frame for all (context, candidate item) pairs in a batch. For each context row, it repeats the context features once per action, attaches candidate item metadata, and selects the affinity score for that candidate item.

Batching keeps memory usage reasonable. The full evaluation split has 100,000 contexts and 34 actions, so scoring everything at once would create 3.4 million rows.

def make_candidate_feature_frame(context_df):
    """Create model features for every candidate action in each context row."""
    n_contexts = len(context_df)
    tiled_actions = np.tile(action_space, n_contexts)

    frame = pd.DataFrame(
        {
            "position": np.repeat(context_df["position"].to_numpy(), n_actions),
            "hour": np.repeat(context_df["hour"].to_numpy(), n_actions),
            "item_id": tiled_actions,
        }
    )

    for col in user_feature_cols:
        frame[col] = np.repeat(context_df[col].to_numpy(), n_actions)

    affinity_matrix = context_df[affinity_cols_by_action].to_numpy()
    frame["selected_affinity"] = affinity_matrix.reshape(-1)

    repeated_item_context = item_context.loc[tiled_actions, item_feature_cols].reset_index(drop=True)
    frame = pd.concat([frame, repeated_item_context], axis=1)

    return frame[feature_cols]

candidate_preview = make_candidate_feature_frame(eval_df.head(2))
candidate_preview

	position	hour	item_id	user_feature_0	user_feature_1	user_feature_2	user_feature_3	item_feature_1	item_feature_2	item_feature_3	item_feature_0
0	2	10	0	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.677183
1	2	10	1	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.720300
2	2	10	2	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	0.745662
3	2	10	3	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
4	2	10	4	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	01fe2f187e459e6ada960671d2942dfe	b4b5879029fb5f64eeec63cf4f73ef0e	b61cfaadd526b816e3aeb9b7be4b4759	1.651109
5	2	10	5	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	01fe2f187e459e6ada960671d2942dfe	c43671ed6855a6fe2e2a6030cba64366	bbf748c6c978938bc63d432efa60191c	0.142031
6	2	10	6	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	7082af732502f0981a9fe77d7ba1ae8a	b61cfaadd526b816e3aeb9b7be4b4759	1.651109
7	2	10	7	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	01fe2f187e459e6ada960671d2942dfe	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	2.858372
8	2	10	8	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	01fe2f187e459e6ada960671d2942dfe	d7f03898d040700d6e1810d21e669958	b61cfaadd526b816e3aeb9b7be4b4759	1.349294
9	2	10	9	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	1.198386
10	2	10	10	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	1.586435
11	2	10	11	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	eddad9910a6d2f61905f408d4df575c5	de8b129010093b09b24a05592bfd8843	0.443847
12	2	10	12	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759	1.198386
13	2	10	13	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759	0.616313
14	2	10	14	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	-1.000557
15	2	10	15	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	9f9ff361c09f765650f1c43ef7adac86	5bc9c86cd1f08a9991670ea97b34f86d	-0.375367
16	2	10	16	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	ce58bf66d7e62186e6ce01bafeea9d39	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.590950
17	2	10	17	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
18	2	10	18	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.914324
19	2	10	19	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.763416
20	2	10	20	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.612508
21	2	10	21	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
22	2	10	22	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	61c5d8c2524684aa047e15e172c7e92f	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.698741
23	2	10	23	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	55fe518d85813954c7d9b8a875ff2453	cc75031396a5aa830885915aa93f49d0	b61cfaadd526b816e3aeb9b7be4b4759	-0.569392
24	2	10	24	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	e95f0d1a3591e01d7ed3f0710424e84d	d7f03898d040700d6e1810d21e669958	b61cfaadd526b816e3aeb9b7be4b4759	0.422288
25	2	10	25	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.461600
26	2	10	26	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	3c2985d744e0d57c261abd7e541e4263	a86ead010f033dbc2854c6a46f4fe7a7	b61cfaadd526b816e3aeb9b7be4b4759	0.896570
27	2	10	27	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	61c5d8c2524684aa047e15e172c7e92f	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.849649
28	2	10	28	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	61c5d8c2524684aa047e15e172c7e92f	b726ac74a20945400f27294febd4ab55	5bc9c86cd1f08a9991670ea97b34f86d	-1.065232
29	2	10	29	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	61c5d8c2524684aa047e15e172c7e92f	7c63a6aa72e655abd1787c2e64385e6f	bbf748c6c978938bc63d432efa60191c	-0.849649
30	2	10	30	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	61c5d8c2524684aa047e15e172c7e92f	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	-0.914324
31	2	10	31	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	55fe518d85813954c7d9b8a875ff2453	7c63a6aa72e655abd1787c2e64385e6f	bbf748c6c978938bc63d432efa60191c	-0.461600
32	2	10	32	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	e95f0d1a3591e01d7ed3f0710424e84d	b726ac74a20945400f27294febd4ab55	5bc9c86cd1f08a9991670ea97b34f86d	-0.526275
33	2	10	33	cef3390ed299c09874189c387777674a	2d03db5543b14483e52d761760686b64	2723d2eb8bba04e0362098011fa3997b	9bde591ffaab8d54c457448e4dca6f53	e95f0d1a3591e01d7ed3f0710424e84d	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.612508
34	2	10	0	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.677183
35	2	10	1	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.720300
36	2	10	2	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	0.745662
37	2	10	3	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
38	2	10	4	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	01fe2f187e459e6ada960671d2942dfe	b4b5879029fb5f64eeec63cf4f73ef0e	b61cfaadd526b816e3aeb9b7be4b4759	1.651109
39	2	10	5	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	01fe2f187e459e6ada960671d2942dfe	c43671ed6855a6fe2e2a6030cba64366	bbf748c6c978938bc63d432efa60191c	0.142031
40	2	10	6	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	7082af732502f0981a9fe77d7ba1ae8a	b61cfaadd526b816e3aeb9b7be4b4759	1.651109
41	2	10	7	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	01fe2f187e459e6ada960671d2942dfe	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	2.858372
42	2	10	8	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	01fe2f187e459e6ada960671d2942dfe	d7f03898d040700d6e1810d21e669958	b61cfaadd526b816e3aeb9b7be4b4759	1.349294
43	2	10	9	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	1.198386
44	2	10	10	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759	1.586435
45	2	10	11	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	eddad9910a6d2f61905f408d4df575c5	de8b129010093b09b24a05592bfd8843	0.443847
46	2	10	12	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759	1.198386
47	2	10	13	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759	0.616313
48	2	10	14	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	-1.000557
49	2	10	15	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	9f9ff361c09f765650f1c43ef7adac86	5bc9c86cd1f08a9991670ea97b34f86d	-0.375367
50	2	10	16	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	ce58bf66d7e62186e6ce01bafeea9d39	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.590950
51	2	10	17	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
52	2	10	18	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.914324
53	2	10	19	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.763416
54	2	10	20	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.612508
55	2	10	21	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d	-0.698741
56	2	10	22	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	61c5d8c2524684aa047e15e172c7e92f	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.698741
57	2	10	23	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	55fe518d85813954c7d9b8a875ff2453	cc75031396a5aa830885915aa93f49d0	b61cfaadd526b816e3aeb9b7be4b4759	-0.569392
58	2	10	24	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	e95f0d1a3591e01d7ed3f0710424e84d	d7f03898d040700d6e1810d21e669958	b61cfaadd526b816e3aeb9b7be4b4759	0.422288
59	2	10	25	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.461600
60	2	10	26	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	3c2985d744e0d57c261abd7e541e4263	a86ead010f033dbc2854c6a46f4fe7a7	b61cfaadd526b816e3aeb9b7be4b4759	0.896570
61	2	10	27	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	61c5d8c2524684aa047e15e172c7e92f	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d	-0.849649
62	2	10	28	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	61c5d8c2524684aa047e15e172c7e92f	b726ac74a20945400f27294febd4ab55	5bc9c86cd1f08a9991670ea97b34f86d	-1.065232
63	2	10	29	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	61c5d8c2524684aa047e15e172c7e92f	7c63a6aa72e655abd1787c2e64385e6f	bbf748c6c978938bc63d432efa60191c	-0.849649
64	2	10	30	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	61c5d8c2524684aa047e15e172c7e92f	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	-0.914324
65	2	10	31	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	55fe518d85813954c7d9b8a875ff2453	7c63a6aa72e655abd1787c2e64385e6f	bbf748c6c978938bc63d432efa60191c	-0.461600
66	2	10	32	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	e95f0d1a3591e01d7ed3f0710424e84d	b726ac74a20945400f27294febd4ab55	5bc9c86cd1f08a9991670ea97b34f86d	-0.526275
67	2	10	33	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	f97571b9c14a786aab269f0b427d2a85	e95f0d1a3591e01d7ed3f0710424e84d	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	-0.612508

Compute Direct-Method Components In Batches

This helper scores every evaluation context against every candidate item for one reward model.

For each context, it computes:

sum_a pi_e(a|x) * q_hat(x, a)

Because our evaluation policies are context-free, the policy probabilities do not vary by row. The reward predictions still vary by row because user features, position, hour, and selected affinity vary by context.

def compute_direct_components(model, contexts_df, batch_size=10_000):
    """Return one direct-method component per context and policy."""
    policy_probability_matrix = policy_probability_df[policy_cols].to_numpy()
    component_batches = []

    for start in range(0, len(contexts_df), batch_size):
        context_batch = contexts_df.iloc[start : start + batch_size]
        candidate_features = make_candidate_feature_frame(context_batch)
        q_hat = model.predict_proba(candidate_features)[:, 1].reshape(len(context_batch), n_actions)
        direct_components = q_hat @ policy_probability_matrix
        component_batches.append(direct_components)

    components = np.vstack(component_batches)
    return pd.DataFrame(components, columns=policy_cols, index=contexts_df.index)

These cells score candidate actions under the reward model, which is what lets the direct method estimate values for policies that choose actions different from the logged one. This is the model-based complement to importance weighting.

Score Logged Actions And Candidate Actions

This cell produces the two reward-model prediction objects needed for DR:

q_hat_logged: predicted click probability for the item that was actually logged
direct_components: expected predicted reward under each evaluation policy for each context

The direct components are the model-based policy values before residual correction.

q_logged_by_model = {}
direct_components_by_model = {}

for model_name, model in reward_models.items():
    q_logged_by_model[model_name] = model.predict_proba(X_eval_logged)[:, 1]
    direct_components_by_model[model_name] = compute_direct_components(model, eval_df, batch_size=10_000)

pd.DataFrame(
    {
        "reward_model": list(q_logged_by_model),
        "mean_q_logged": [pred.mean() for pred in q_logged_by_model.values()],
        "direct_component_rows": [len(direct_components_by_model[name]) for name in q_logged_by_model],
    }
)

	reward_model	mean_q_logged	direct_component_rows
0	logistic	0.005774	100000
1	lightgbm	0.004888	100000

Attach Evaluation Policy Weights

This cell computes pi_e(A|X) / pi_b(A|X) for the logged action under each evaluation policy.

These are the same importance weights used in Notebook 3. In the doubly robust estimator, the weights multiply the reward-model residual Y - q_hat(X, A) rather than the raw reward Y.

eval_scored = eval_df[["timestamp", "item_id", "position", "click", "propensity_score"]].copy()

for policy in policy_cols:
    probability_map = policy_probability_df.set_index("item_id")[policy]
    eval_scored[f"pi_e_{policy}"] = eval_scored["item_id"].map(probability_map)
    eval_scored[f"w_{policy}"] = eval_scored[f"pi_e_{policy}"] / eval_scored["propensity_score"]

weight_check = pd.DataFrame(
    [
        {
            "policy": policy,
            "mean_weight": eval_scored[f"w_{policy}"].mean(),
            "max_weight": eval_scored[f"w_{policy}"].max(),
            "p99_weight": np.percentile(eval_scored[f"w_{policy}"], 99),
        }
        for policy in policy_cols
    ]
)

weight_check

	policy	mean_weight	max_weight	p99_weight
0	uniform	1.000000	1.000000	1.000000
1	exposure_popularity	1.000591	1.132200	1.132200
2	ctr_weighted	0.996987	2.740842	2.740842
3	epsilon_greedy_top_ctr	1.001105	29.050000	29.050000

Attaching evaluation-policy probabilities to logged rows creates the numerator of the importance weight. Once this is joined to the behavior propensity, IPS and SNIPS can estimate each policy’s value.

Define OPE Estimator Helpers

This cell defines helper functions for IPS, SNIPS, direct method, and doubly robust estimation.

Each estimator returns an estimate, approximate standard error, confidence interval, and weight diagnostics where relevant. The confidence intervals use large-sample standard errors. They are useful for comparison, but later production work would likely use richer bootstrap or repeated-split diagnostics.

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return weights.sum() ** 2 / np.square(weights).sum()


def summarize_signal(signal):
    signal = np.asarray(signal, dtype=float)
    estimate = signal.mean()
    se = signal.std(ddof=1) / np.sqrt(len(signal))
    return estimate, se, estimate - 1.96 * se, estimate + 1.96 * se


def estimate_ips_snips(reward, weight):
    reward = np.asarray(reward, dtype=float)
    weight = np.asarray(weight, dtype=float)
    ips_signal = weight * reward
    ips, ips_se, ips_lower, ips_upper = summarize_signal(ips_signal)

    snips = ips_signal.sum() / weight.sum()
    snips_influence = weight * (reward - snips) / weight.mean()
    snips_se = snips_influence.std(ddof=1) / np.sqrt(len(snips_influence))

    return {
        "ips": (ips, ips_se, ips_lower, ips_upper),
        "snips": (snips, snips_se, snips - 1.96 * snips_se, snips + 1.96 * snips_se),
        "ess_share": effective_sample_size(weight) / len(weight),
        "mean_weight": weight.mean(),
        "max_weight": weight.max(),
    }


def estimate_dm_dr(reward, weight, q_logged, direct_component):
    reward = np.asarray(reward, dtype=float)
    weight = np.asarray(weight, dtype=float)
    q_logged = np.asarray(q_logged, dtype=float)
    direct_component = np.asarray(direct_component, dtype=float)

    dm, dm_se, dm_lower, dm_upper = summarize_signal(direct_component)
    dr_signal = direct_component + weight * (reward - q_logged)
    dr, dr_se, dr_lower, dr_upper = summarize_signal(dr_signal)

    correction = dr_signal - direct_component
    return {
        "dm": (dm, dm_se, dm_lower, dm_upper),
        "dr": (dr, dr_se, dr_lower, dr_upper),
        "mean_abs_correction": np.abs(correction).mean(),
        "mean_correction": correction.mean(),
    }

The helper functions encode the estimator formulas and diagnostics used repeatedly in the notebook. Defining them once keeps the later policy comparisons consistent and easier to audit.

Estimate IPS, SNIPS, Direct Method, And DR

This cell computes the main OPE result table.

For each policy, IPS and SNIPS depend only on propensities and observed rewards. Direct method and DR are computed separately for each reward model. The DR estimate uses the reward model prediction plus an importance-weighted residual correction.

result_rows = []
reward = eval_scored["click"].to_numpy()

for policy in policy_cols:
    weight = eval_scored[f"w_{policy}"].to_numpy()
    ips_snips = estimate_ips_snips(reward, weight)

    for estimator_name, values in [("IPS", ips_snips["ips"]), ("SNIPS", ips_snips["snips"] )]:
        estimate, se, lower, upper = values
        result_rows.append(
            {
                "policy": policy,
                "estimator": estimator_name,
                "reward_model": "none",
                "estimate": estimate,
                "se": se,
                "ci_95_lower": lower,
                "ci_95_upper": upper,
                "ess_share": ips_snips["ess_share"],
                "mean_weight": ips_snips["mean_weight"],
                "max_weight": ips_snips["max_weight"],
                "mean_abs_correction": np.nan,
                "mean_correction": np.nan,
            }
        )

    for model_name in reward_models:
        dm_dr = estimate_dm_dr(
            reward=reward,
            weight=weight,
            q_logged=q_logged_by_model[model_name],
            direct_component=direct_components_by_model[model_name][policy],
        )
        for estimator_name, values in [("DM", dm_dr["dm"]), ("DR", dm_dr["dr"] )]:
            estimate, se, lower, upper = values
            result_rows.append(
                {
                    "policy": policy,
                    "estimator": estimator_name,
                    "reward_model": model_name,
                    "estimate": estimate,
                    "se": se,
                    "ci_95_lower": lower,
                    "ci_95_upper": upper,
                    "ess_share": ips_snips["ess_share"],
                    "mean_weight": ips_snips["mean_weight"],
                    "max_weight": ips_snips["max_weight"],
                    "mean_abs_correction": dm_dr["mean_abs_correction"],
                    "mean_correction": dm_dr["mean_correction"],
                }
            )

all_estimates = pd.DataFrame(result_rows)
all_estimates.head(12)

	policy	estimator	reward_model	estimate	se	ci_95_lower	ci_95_upper	ess_share	mean_weight	max_weight	mean_abs_correction	mean_correction
0	uniform	IPS	none	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000	NaN	NaN
1	uniform	SNIPS	none	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000	NaN	NaN
2	uniform	DM	logistic	0.005786	0.000011	0.005764	0.005807	1.000000	1.000000	1.000000	0.010685	-0.000794
3	uniform	DR	logistic	0.004992	0.000225	0.004550	0.005434	1.000000	1.000000	1.000000	0.010685	-0.000794
4	uniform	DM	lightgbm	0.004943	0.000013	0.004918	0.004968	1.000000	1.000000	1.000000	0.009813	0.000092
5	uniform	DR	lightgbm	0.005035	0.000226	0.004592	0.005477	1.000000	1.000000	1.000000	0.009813	0.000092
6	exposure_popularity	IPS	none	0.004971	0.000223	0.004535	0.005407	0.996399	1.000591	1.132200	NaN	NaN
7	exposure_popularity	SNIPS	none	0.004968	0.000223	0.004532	0.005404	0.996399	1.000591	1.132200	NaN	NaN
8	exposure_popularity	DM	logistic	0.005733	0.000011	0.005711	0.005754	0.996399	1.000591	1.132200	0.010630	-0.000756
9	exposure_popularity	DR	logistic	0.004977	0.000225	0.004535	0.005418	0.996399	1.000591	1.132200	0.010630	-0.000756
10	exposure_popularity	DM	lightgbm	0.004913	0.000013	0.004888	0.004938	0.996399	1.000591	1.132200	0.009779	0.000108
11	exposure_popularity	DR	lightgbm	0.005021	0.000226	0.004578	0.005463	0.996399	1.000591	1.132200	0.009779	0.000108

This table compares the main OPE estimator families. DR is especially useful because it combines reward-model predictions with importance-weighted residuals, reducing reliance on either component alone.

Add The Observed Random-Policy Baseline

The held-out observed click rate is the on-policy value estimate for the random behavior policy. It is not counterfactual, but it gives the table a concrete baseline.

Policy lift in later cells is measured relative to this observed random-policy click rate.

behavior_value = eval_scored["click"].mean()
behavior_se = eval_scored["click"].std(ddof=1) / np.sqrt(len(eval_scored))

observed_baseline = pd.DataFrame(
    [
        {
            "policy": "behavior_random_observed",
            "estimator": "Observed",
            "reward_model": "none",
            "estimate": behavior_value,
            "se": behavior_se,
            "ci_95_lower": behavior_value - 1.96 * behavior_se,
            "ci_95_upper": behavior_value + 1.96 * behavior_se,
            "ess_share": 1.0,
            "mean_weight": 1.0,
            "max_weight": 1.0,
            "mean_abs_correction": np.nan,
            "mean_correction": np.nan,
        }
    ]
)

estimate_table = pd.concat([observed_baseline, all_estimates], ignore_index=True)
estimate_table["lift_pp"] = 100 * (estimate_table["estimate"] - behavior_value)
estimate_table["relative_lift_pct"] = 100 * (estimate_table["estimate"] / behavior_value - 1)

estimate_table.sort_values(["policy", "estimator", "reward_model"]).head(20)

	policy	estimator	reward_model	estimate	se	ci_95_lower	ci_95_upper	ess_share	mean_weight	max_weight	mean_abs_correction	mean_correction	lift_pp	relative_lift_pct
0	behavior_random_observed	Observed	none	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000	NaN	NaN	0.000000	0.000000
17	ctr_weighted	DM	lightgbm	0.006021	0.000016	0.005989	0.006052	0.790628	0.996987	2.740842	0.011017	-0.000741	0.104055	20.894635
15	ctr_weighted	DM	logistic	0.007355	0.000014	0.007328	0.007382	0.790628	0.996987	2.740842	0.012451	-0.002195	0.237523	47.695402
18	ctr_weighted	DR	lightgbm	0.005280	0.000267	0.004756	0.005803	0.790628	0.996987	2.740842	0.011017	-0.000741	0.029956	6.015257
16	ctr_weighted	DR	logistic	0.005160	0.000268	0.004634	0.005687	0.790628	0.996987	2.740842	0.012451	-0.002195	0.018048	3.624090
13	ctr_weighted	IPS	none	0.005172	0.000261	0.004660	0.005685	0.790628	0.996987	2.740842	NaN	NaN	0.019244	3.864167
14	ctr_weighted	SNIPS	none	0.005188	0.000262	0.004675	0.005701	0.790628	0.996987	2.740842	NaN	NaN	0.020806	4.178004
23	epsilon_greedy_top_ctr	DM	lightgbm	0.011740	0.000068	0.011607	0.011873	0.040290	1.001105	29.050000	0.017472	-0.005122	0.676002	135.743313
21	epsilon_greedy_top_ctr	DM	logistic	0.014628	0.000025	0.014580	0.014676	0.040290	1.001105	29.050000	0.020619	-0.008321	0.964777	193.730292
24	epsilon_greedy_top_ctr	DR	lightgbm	0.006618	0.001324	0.004023	0.009214	0.040290	1.001105	29.050000	0.017472	-0.005122	0.163824	32.896440
22	epsilon_greedy_top_ctr	DR	logistic	0.006306	0.001279	0.003799	0.008814	0.040290	1.001105	29.050000	0.020619	-0.008321	0.132637	26.633840
19	epsilon_greedy_top_ctr	IPS	none	0.006238	0.001267	0.003756	0.008720	0.040290	1.001105	29.050000	NaN	NaN	0.125800	25.261044
20	epsilon_greedy_top_ctr	SNIPS	none	0.006231	0.001261	0.003759	0.008703	0.040290	1.001105	29.050000	NaN	NaN	0.125111	25.122784
11	exposure_popularity	DM	lightgbm	0.004913	0.000013	0.004888	0.004938	0.996399	1.000591	1.132200	0.009779	0.000108	-0.006709	-1.347183
9	exposure_popularity	DM	logistic	0.005733	0.000011	0.005711	0.005754	0.996399	1.000591	1.132200	0.010630	-0.000756	0.075267	15.113877
12	exposure_popularity	DR	lightgbm	0.005021	0.000226	0.004578	0.005463	0.996399	1.000591	1.132200	0.009779	0.000108	0.004067	0.816622
10	exposure_popularity	DR	logistic	0.004977	0.000225	0.004535	0.005418	0.996399	1.000591	1.132200	0.010630	-0.000756	-0.000340	-0.068183
7	exposure_popularity	IPS	none	0.004971	0.000223	0.004535	0.005407	0.996399	1.000591	1.132200	NaN	NaN	-0.000904	-0.181598
8	exposure_popularity	SNIPS	none	0.004968	0.000223	0.004532	0.005404	0.996399	1.000591	1.132200	NaN	NaN	-0.001198	-0.240522
5	uniform	DM	lightgbm	0.004943	0.000013	0.004918	0.004968	1.000000	1.000000	1.000000	0.009813	0.000092	-0.003706	-0.744091

Adding the observed behavior-policy baseline gives a familiar reference point. Evaluation policies can now be interpreted as offline alternatives to the policy that generated the held-out log.

Compare Main Estimators Visually

This plot compares IPS, SNIPS, and LightGBM DR estimates. LightGBM DR is shown because it is the strongest reward model in this notebook, while IPS and SNIPS provide continuity with Notebook 3.

The confidence intervals are approximate. The main purpose of the plot is to show whether DR stabilizes policy rankings and whether it agrees directionally with pure importance weighting.

plot_estimates = estimate_table[
    (estimate_table["estimator"].isin(["Observed", "IPS", "SNIPS"]))
    | ((estimate_table["estimator"] == "DR") & (estimate_table["reward_model"] == "lightgbm"))
].copy()
plot_estimates["estimator_label"] = np.where(
    plot_estimates["estimator"] == "DR",
    "DR LightGBM",
    plot_estimates["estimator"],
)
plot_estimates["lower_error"] = plot_estimates["estimate"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["estimate"]

policy_order = plot_estimates["policy"].drop_duplicates().tolist()
estimator_order = ["Observed", "IPS", "SNIPS", "DR LightGBM"]
offsets = {"Observed": 0.0, "IPS": -0.24, "SNIPS": 0.0, "DR LightGBM": 0.24}
colors = {"Observed": "#4C78A8", "IPS": "#F58518", "SNIPS": "#54A24B", "DR LightGBM": "#B279A2"}

fig, ax = plt.subplots(figsize=(12, 5))
for estimator in estimator_order:
    subset = plot_estimates[plot_estimates["estimator_label"] == estimator]
    for _, row in subset.iterrows():
        x_base = policy_order.index(row["policy"])
        x = x_base + offsets[estimator]
        ax.errorbar(
            x=x,
            y=row["estimate"],
            yerr=[[row["lower_error"]], [row["upper_error"]]],
            fmt="o",
            color=colors[estimator],
            ecolor=colors[estimator],
            capsize=4,
            linewidth=1.4,
            markersize=6,
            label=estimator if row["policy"] == subset["policy"].iloc[0] else None,
        )

ax.set_xticks(range(len(policy_order)))
ax.set_xticklabels(policy_order, rotation=25, ha="right")
ax.set_title("Policy Value Estimates: IPS, SNIPS, And DR")
ax.set_xlabel("Policy")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
ax.legend(title="Estimator")
plt.tight_layout()
plt.show()

Direct Method Versus DR By Reward Model

This table compares direct method and DR estimates for both reward models.

The direct method is fully model-based. DR adds the residual correction. If the correction is large, it means observed outcomes disagree with the reward model in a way that matters for that policy. Large corrections are not automatically bad, but they deserve interpretation.

dm_dr_table = estimate_table[estimate_table["estimator"].isin(["DM", "DR"])].copy()
dm_dr_table = dm_dr_table[
    [
        "policy",
        "reward_model",
        "estimator",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "mean_abs_correction",
        "mean_correction",
    ]
].sort_values(["policy", "reward_model", "estimator"])

dm_dr_table

	policy	reward_model	estimator	estimate	ci_95_lower	ci_95_upper	lift_pp	relative_lift_pct	mean_abs_correction	mean_correction
17	ctr_weighted	lightgbm	DM	0.006021	0.005989	0.006052	0.104055	20.894635	0.011017	-0.000741
18	ctr_weighted	lightgbm	DR	0.005280	0.004756	0.005803	0.029956	6.015257	0.011017	-0.000741
15	ctr_weighted	logistic	DM	0.007355	0.007328	0.007382	0.237523	47.695402	0.012451	-0.002195
16	ctr_weighted	logistic	DR	0.005160	0.004634	0.005687	0.018048	3.624090	0.012451	-0.002195
23	epsilon_greedy_top_ctr	lightgbm	DM	0.011740	0.011607	0.011873	0.676002	135.743313	0.017472	-0.005122
24	epsilon_greedy_top_ctr	lightgbm	DR	0.006618	0.004023	0.009214	0.163824	32.896440	0.017472	-0.005122
21	epsilon_greedy_top_ctr	logistic	DM	0.014628	0.014580	0.014676	0.964777	193.730292	0.020619	-0.008321
22	epsilon_greedy_top_ctr	logistic	DR	0.006306	0.003799	0.008814	0.132637	26.633840	0.020619	-0.008321
11	exposure_popularity	lightgbm	DM	0.004913	0.004888	0.004938	-0.006709	-1.347183	0.009779	0.000108
12	exposure_popularity	lightgbm	DR	0.005021	0.004578	0.005463	0.004067	0.816622	0.009779	0.000108
9	exposure_popularity	logistic	DM	0.005733	0.005711	0.005754	0.075267	15.113877	0.010630	-0.000756
10	exposure_popularity	logistic	DR	0.004977	0.004535	0.005418	-0.000340	-0.068183	0.010630	-0.000756
5	uniform	lightgbm	DM	0.004943	0.004918	0.004968	-0.003706	-0.744091	0.009813	0.000092
6	uniform	lightgbm	DR	0.005035	0.004592	0.005477	0.005491	1.102546	0.009813	0.000092
3	uniform	logistic	DM	0.005786	0.005764	0.005807	0.080553	16.175230	0.010685	-0.000794
4	uniform	logistic	DR	0.004992	0.004550	0.005434	0.001201	0.241083	0.010685	-0.000794

Comparing DM and DR shows how much the residual correction changes the model-only estimate. Large differences mean observed logged rewards are correcting the reward model substantially.

Plot Direct Method Versus DR

This plot focuses only on model-based estimators. It shows how much the DR correction moves the direct-method estimates for each reward model.

The labels combine estimator and reward model, for example DR lightgbm. This avoids overloading Seaborn with multiple grouping variables and makes the comparison easier to read.

dm_dr_plot = dm_dr_table.copy()
dm_dr_plot["estimator_model"] = dm_dr_plot["estimator"] + " " + dm_dr_plot["reward_model"]

fig, ax = plt.subplots(figsize=(12, 5))
sns.pointplot(
    data=dm_dr_plot,
    x="policy",
    y="estimate",
    hue="estimator_model",
    dodge=0.45,
    errorbar=None,
    ax=ax,
)
ax.set_title("Direct Method Versus Doubly Robust Estimates")
ax.set_xlabel("Policy")
ax.set_ylabel("Estimated Click Rate")
ax.tick_params(axis="x", rotation=25)
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
ax.legend(title="Estimator / reward model", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()

Comparing DM and DR shows how much the residual correction changes the model-only estimate. Large differences mean observed logged rewards are correcting the reward model substantially.

DR Correction Diagnostics

This cell summarizes the size of the residual correction in the DR estimator. The correction term is:

weight * (Y - q_hat(X, A))

A correction near zero on average means the direct method already aligns with observed residuals under that policy. A larger correction means the logged outcomes are materially changing the model-based estimate.

correction_diagnostics = (
    estimate_table.query("estimator == 'DR'")
    [["policy", "reward_model", "mean_correction", "mean_abs_correction", "ess_share", "max_weight"]]
    .sort_values("mean_abs_correction", ascending=False)
)

correction_diagnostics

	policy	reward_model	mean_correction	mean_abs_correction	ess_share	max_weight
22	epsilon_greedy_top_ctr	logistic	-0.008321	0.020619	0.040290	29.050000
24	epsilon_greedy_top_ctr	lightgbm	-0.005122	0.017472	0.040290	29.050000
16	ctr_weighted	logistic	-0.002195	0.012451	0.790628	2.740842
18	ctr_weighted	lightgbm	-0.000741	0.011017	0.790628	2.740842
4	uniform	logistic	-0.000794	0.010685	1.000000	1.000000
10	exposure_popularity	logistic	-0.000756	0.010630	0.996399	1.132200
6	uniform	lightgbm	0.000092	0.009813	1.000000	1.000000
12	exposure_popularity	lightgbm	0.000108	0.009779	0.996399	1.132200

The correction diagnostics reveal how much of the DR estimate comes from importance-weighted residuals. Large or unstable corrections point to either support issues, reward-model misspecification, or both.

LightGBM Feature Importance

This cell extracts the top LightGBM feature importances after preprocessing.

Feature importance is not a causal explanation. It only tells us which transformed features the reward model used for prediction. Still, it is useful for debugging and storytelling: a sensible model should use item identity, position, user features, item metadata, or affinity signals rather than random artifacts.

lightgbm_pipeline = reward_models["lightgbm"]
feature_names = lightgbm_pipeline.named_steps["preprocess"].get_feature_names_out()
feature_importances = lightgbm_pipeline.named_steps["model"].feature_importances_

importance_df = (
    pd.DataFrame({"feature": feature_names, "importance": feature_importances})
    .sort_values("importance", ascending=False)
    .head(20)
)

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=importance_df.sort_values("importance"), x="importance", y="feature", ax=ax, color="#4C78A8")
ax.set_title("Top LightGBM Reward Model Features")
ax.set_xlabel("Feature Importance")
ax.set_ylabel("Feature")
plt.tight_layout()
plt.show()

importance_df

	feature	importance
119	numeric__item_feature_0	350
0	categorical__position_1	236
87	categorical__user_feature_3_9bde591ffaab8d54c4...	174
1	categorical__position_2	173
2	categorical__position_3	171
88	categorical__user_feature_3_c39b0c7dd5d4eb9a18...	139
118	numeric__selected_affinity	136
61	categorical__user_feature_0_1a2b2ad3a7f218a0d7...	122
77	categorical__user_feature_2_9b2d331c329ceb74d3...	120
74	categorical__user_feature_2_719dab53a7560218a9...	119
12	categorical__hour_9	116
4	categorical__hour_1	116
18	categorical__hour_15	114
15	categorical__hour_12	112
71	categorical__user_feature_2_2723d2eb8bba04e036...	104
65	categorical__user_feature_1_03a5648a76832f83c8...	103
17	categorical__hour_14	100
3	categorical__hour_0	96
82	categorical__user_feature_3_06128286bcc64b6a4b...	93
90	categorical__user_feature_3_f97571b9c14a786aab...	92

Feature importance helps explain what the reward model uses to predict clicks. These importances support model interpretation, but they should not be read as causal effects of the features themselves.

Weight And Model Diagnostics Together

This cell brings together the two main sources of uncertainty:

importance-weight stability
reward-model quality

A good DR estimate should have reasonable support and a reward model that is at least directionally predictive. Weak weights plus a weak reward model would make any offline policy conclusion fragile.

lightgbm_metric_row = reward_model_metrics.query("reward_model == 'lightgbm'").iloc[0]
combined_diagnostics = (
    estimate_table.query("estimator == 'DR' and reward_model == 'lightgbm'")
    [["policy", "estimate", "lift_pp", "ess_share", "mean_weight", "max_weight", "mean_abs_correction"]]
    .assign(
        reward_model_auc=lightgbm_metric_row["auc"],
        reward_model_log_loss=lightgbm_metric_row["log_loss"],
        reward_model_brier=lightgbm_metric_row["brier_score"],
    )
    .sort_values("estimate", ascending=False)
)

combined_diagnostics

	policy	estimate	lift_pp	ess_share	mean_weight	max_weight	mean_abs_correction	reward_model_auc	reward_model_log_loss	reward_model_brier
24	epsilon_greedy_top_ctr	0.006618	0.163824	0.040290	1.001105	29.050000	0.017472	0.533007	0.034352	0.005104
18	ctr_weighted	0.005280	0.029956	0.790628	0.996987	2.740842	0.011017	0.533007	0.034352	0.005104
6	uniform	0.005035	0.005491	1.000000	1.000000	1.000000	0.009813	0.533007	0.034352	0.005104
12	exposure_popularity	0.005021	0.004067	0.996399	1.000591	1.132200	0.009779	0.533007	0.034352	0.005104

Combining weight and model diagnostics gives a fuller risk picture than policy value alone. A policy should look good not only on estimated reward, but also on support, ESS, and model quality.

Main Notebook Result Table

This final result table is the compact version to reference in a portfolio writeup.

It focuses on LightGBM DR because that is the strongest reward-model version in this notebook, while still retaining IPS and SNIPS as benchmarks from the previous notebook. The best policy should be interpreted alongside ESS, confidence intervals, and the observational limitations of logged bandit data.

main_result_table = estimate_table[
    (estimate_table["estimator"].isin(["Observed", "IPS", "SNIPS"]))
    | ((estimate_table["estimator"] == "DR") & (estimate_table["reward_model"] == "lightgbm"))
].copy()
main_result_table = main_result_table[
    [
        "policy",
        "estimator",
        "reward_model",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "ess_share",
        "mean_weight",
        "max_weight",
    ]
].sort_values(["estimate", "ess_share"], ascending=[False, False])

main_result_table

	policy	estimator	reward_model	estimate	ci_95_lower	ci_95_upper	lift_pp	relative_lift_pct	ess_share	mean_weight	max_weight
24	epsilon_greedy_top_ctr	DR	lightgbm	0.006618	0.004023	0.009214	0.163824	32.896440	0.040290	1.001105	29.050000
19	epsilon_greedy_top_ctr	IPS	none	0.006238	0.003756	0.008720	0.125800	25.261044	0.040290	1.001105	29.050000
20	epsilon_greedy_top_ctr	SNIPS	none	0.006231	0.003759	0.008703	0.125111	25.122784	0.040290	1.001105	29.050000
18	ctr_weighted	DR	lightgbm	0.005280	0.004756	0.005803	0.029956	6.015257	0.790628	0.996987	2.740842
14	ctr_weighted	SNIPS	none	0.005188	0.004675	0.005701	0.020806	4.178004	0.790628	0.996987	2.740842
13	ctr_weighted	IPS	none	0.005172	0.004660	0.005685	0.019244	3.864167	0.790628	0.996987	2.740842
6	uniform	DR	lightgbm	0.005035	0.004592	0.005477	0.005491	1.102546	1.000000	1.000000	1.000000
12	exposure_popularity	DR	lightgbm	0.005021	0.004578	0.005463	0.004067	0.816622	0.996399	1.000591	1.132200
1	uniform	IPS	none	0.004980	0.004544	0.005416	0.000000	0.000000	1.000000	1.000000	1.000000
0	behavior_random_observed	Observed	none	0.004980	0.004544	0.005416	0.000000	0.000000	1.000000	1.000000	1.000000
2	uniform	SNIPS	none	0.004980	0.004544	0.005416	0.000000	0.000000	1.000000	1.000000	1.000000
7	exposure_popularity	IPS	none	0.004971	0.004535	0.005407	-0.000904	-0.181598	0.996399	1.000591	1.132200
8	exposure_popularity	SNIPS	none	0.004968	0.004532	0.005404	-0.001198	-0.240522	0.996399	1.000591	1.132200

Notebook 4 Takeaways

This notebook introduced doubly robust OPE for the Open Bandit project.

The key ideas are:

IPS and SNIPS use propensities and observed rewards directly.
The direct method uses a reward model to predict outcomes for every candidate action.
Doubly robust OPE combines the direct method with an IPS-style residual correction.
Reward-model diagnostics matter because DR is only useful if the model contains real signal.
Weight diagnostics still matter because the residual correction uses importance weights.
Comparing logistic DR and LightGBM DR is a useful sensitivity check.

The next notebook should focus on policy comparison and sensitivity: clipping thresholds, alternate reward models, train/evaluation split sensitivity, and a polished recommendation about which policy is most credible offline.