04 Doubly Robust Off-Policy Evaluation

This notebook upgrades the analysis from pure importance weighting to doubly robust off-policy evaluation.

Notebook 3 estimated policy values with IPS and SNIPS. Those estimators are valuable because they use the known logging propensities directly, but they can become noisy when importance weights are large or concentrated. Doubly robust OPE adds a second ingredient: a reward model.

The reward model estimates:

q(x, a) = E[Y | X = x, A = a]

For this project, Y is click, X is recommendation context, and A is the recommended item. The doubly robust estimator combines:

This notebook keeps the same evaluation policies from Notebook 3 so that the comparison is clear: IPS, SNIPS, direct method, and doubly robust estimates are all answering the same policy-value question.

Why Doubly Robust OPE?

IPS estimates policy value by weighting observed rewards:

mean((pi_e(A|X) / pi_b(A|X)) * Y)

This is powerful because it does not require a reward model, but it can be high variance. If the behavior policy rarely chose an action that the evaluation policy likes, the importance weight can become large.

The direct method takes the opposite approach. It trains a reward model q_hat(x, a) and estimates policy value by averaging predicted rewards under the evaluation policy:

mean(sum_a pi_e(a|x) * q_hat(x, a))

The direct method can be stable, but it is biased if the reward model is wrong.

The doubly robust estimator combines both:

mean(sum_a pi_e(a|x) * q_hat(x, a) + (pi_e(A|X) / pi_b(A|X)) * (Y - q_hat(X, A)))

The first term is the model-based policy value. The second term corrects the model using observed residuals on logged actions. It is called doubly robust because, under standard assumptions, it can remain consistent if either the propensities are correct or the reward model is correct.

Notebook Setup

This cell imports data, modeling, evaluation, and plotting libraries. It also suppresses one known LightGBM/sklearn feature-name metadata warning so the notebook output stays readable while real warnings remain visible.

We train two reward models:

  • logistic regression as a transparent baseline
  • LightGBM as a stronger nonlinear model

Using both models is useful because doubly robust OPE should not be treated as a black box. If DR estimates move sharply across reward models, that is a sensitivity warning.

from pathlib import Path
import warnings

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, brier_score_loss, log_loss, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names, but LGBMClassifier was fitted with feature names",
    category=UserWarning,
)

This cell prepares the notebook environment for doubly robust off-policy evaluation. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Locate The Cached Open Bandit Sample

This cell finds the repository root and loads the random-policy sample created in Notebook 1.

We use the random-policy log for the main DR estimates because Notebook 2 showed that it has broad support and stable propensities. This makes it the cleanest source for introducing a reward model and doubly robust correction.

RANDOM_SAMPLE_RELATIVE_PATH = Path("data/processed/open_bandit_random_men_sample.parquet")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / RANDOM_SAMPLE_RELATIVE_PATH).exists()
)

RANDOM_SAMPLE_PATH = PROJECT_ROOT / RANDOM_SAMPLE_RELATIVE_PATH
random_df = pd.read_parquet(RANDOM_SAMPLE_PATH).sort_values("timestamp").reset_index(drop=True)

pd.Series(
    {
        "project_root": PROJECT_ROOT,
        "random_sample_path": RANDOM_SAMPLE_PATH,
        "rows": len(random_df),
        "columns": random_df.shape[1],
        "observed_click_rate": random_df["click"].mean(),
    }
).to_frame("value")
value
project_root /home/apex/Documents/ranking_sys
random_sample_path /home/apex/Documents/ranking_sys/data/processe...
rows 200000
columns 50
observed_click_rate 0.005190

The printed paths are a reproducibility checkpoint. Once the notebook can find the cached data and writeup folders, the rest of the analysis can run without manual path edits.

Recreate The Train And Evaluation Split

To stay aligned with Notebook 3, this notebook uses the same time-based split:

  • the first half of rows defines policies and trains reward models
  • the second half evaluates policy values

This separation matters. We should not tune a policy or reward model on the exact same rows used for final OPE without acknowledging the risk of overfitting.

SPLIT_FRACTION = 0.50
split_idx = int(len(random_df) * SPLIT_FRACTION)

train_df = random_df.iloc[:split_idx].copy()
eval_df = random_df.iloc[split_idx:].copy()

split_summary = pd.DataFrame(
    {
        "split": ["train", "evaluation"],
        "rows": [len(train_df), len(eval_df)],
        "min_timestamp": [train_df["timestamp"].min(), eval_df["timestamp"].min()],
        "max_timestamp": [train_df["timestamp"].max(), eval_df["timestamp"].max()],
        "click_rate": [train_df["click"].mean(), eval_df["click"].mean()],
    }
)

split_summary
split rows min_timestamp max_timestamp click_rate
0 train 100000 2019-11-24 00:00:03.800821+00:00 2019-11-25 10:01:18.392921+00:00 0.005400
1 evaluation 100000 2019-11-25 10:01:18.393450+00:00 2019-11-27 02:50:16.027289+00:00 0.004980

The split separates policy construction from policy evaluation. This prevents using the same rows to design a policy and evaluate it, which would make the offline result too optimistic.

Define The Action Space And Feature Groups

The action is the recommended item. In the men campaign there are 34 item actions.

For reward modeling, we build features that are available before the click outcome:

  • user categorical features
  • recommendation position and hour
  • candidate item ID
  • candidate item metadata
  • selected user-item affinity for the candidate item

The selected affinity is important. The raw log has one affinity column per item. When we score a candidate action, we use the affinity column corresponding to that candidate item.

action_space = np.array(sorted(random_df["item_id"].unique()))
n_actions = len(action_space)

user_feature_cols = [col for col in random_df.columns if col.startswith("user_feature_")]
affinity_cols_by_action = [f"user-item_affinity_{item_id}" for item_id in action_space]
item_feature_cols = [col for col in random_df.columns if col.startswith("item_feature_")]

categorical_features = [
    "position",
    "hour",
    "item_id",
    *user_feature_cols,
    "item_feature_1",
    "item_feature_2",
    "item_feature_3",
]
numeric_features = ["selected_affinity", "item_feature_0"]
feature_cols = categorical_features + numeric_features

feature_group_summary = pd.DataFrame(
    {
        "group": ["actions", "user categorical", "item metadata", "model categorical", "model numeric"],
        "count": [n_actions, len(user_feature_cols), len(item_feature_cols), len(categorical_features), len(numeric_features)],
        "columns": [list(action_space), user_feature_cols, item_feature_cols, categorical_features, numeric_features],
    }
)

feature_group_summary
group count columns
0 actions 34 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1 user categorical 4 [user_feature_0, user_feature_1, user_feature_...
2 item metadata 4 [item_feature_0, item_feature_1, item_feature_...
3 model categorical 10 [position, hour, item_id, user_feature_0, user...
4 model numeric 2 [selected_affinity, item_feature_0]

The action-space definition fixes the set of items that evaluation policies are allowed to choose. OPE estimates are only meaningful for policies whose action probabilities live inside this logged support.

Build Item Metadata Lookup

The cached random sample already contains item metadata for the logged item. To score counterfactual candidate actions, we need one metadata row per item.

This cell builds an item lookup table. Later, when we score item a for a context, we attach the metadata for item a, not the metadata for the item that happened to be logged in that row.

item_context = (
    random_df[["item_id", *item_feature_cols]]
    .drop_duplicates("item_id")
    .set_index("item_id")
    .sort_index()
)

missing_items = sorted(set(action_space) - set(item_context.index))
if missing_items:
    raise ValueError(f"Missing item metadata for item IDs: {missing_items}")

item_context.head()
item_feature_0 item_feature_1 item_feature_2 item_feature_3
item_id
0 -0.677183 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c
1 -0.720300 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c
2 0.745662 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759
3 -0.698741 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d
4 1.651109 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759

These feature-building cells define the context used by reward models. Reward models need both user context and candidate item context so they can predict counterfactual rewards for actions that were not logged.

Create Logged-Action Reward Model Features

This helper creates a supervised-learning frame for rows where the logged action is known.

For each row, selected_affinity is taken from the affinity column that corresponds to the logged item_id. This makes the training data match the counterfactual scoring data we will create later: every feature row represents a specific context-action pair.

def make_logged_feature_frame(df):
    """Create reward-model features for the action that was actually logged."""
    frame = df.copy()
    affinity_matrix = frame[affinity_cols_by_action].to_numpy()
    item_positions = frame["item_id"].map({item_id: idx for idx, item_id in enumerate(action_space)}).to_numpy()
    frame["selected_affinity"] = affinity_matrix[np.arange(len(frame)), item_positions]
    return frame[feature_cols]

X_train = make_logged_feature_frame(train_df)
y_train = train_df["click"].astype(int)
X_eval_logged = make_logged_feature_frame(eval_df)
y_eval = eval_df["click"].astype(int)

X_train.head()
position hour item_id user_feature_0 user_feature_1 user_feature_2 user_feature_3 item_feature_1 item_feature_2 item_feature_3 selected_affinity item_feature_0
0 1 0 0 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 2723d2eb8bba04e0362098011fa3997b c39b0c7dd5d4eb9a18e7db6ba2f258f8 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.677183
1 3 0 25 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 2723d2eb8bba04e0362098011fa3997b c39b0c7dd5d4eb9a18e7db6ba2f258f8 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
2 2 0 23 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 2723d2eb8bba04e0362098011fa3997b c39b0c7dd5d4eb9a18e7db6ba2f258f8 55fe518d85813954c7d9b8a875ff2453 cc75031396a5aa830885915aa93f49d0 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 -0.569392
3 1 0 25 1a2b2ad3a7f218a0d709dd9c656fda27 e3528f5280f04c0031d337da1def86ea 398773dacf8501ee8f76e3706ccafbba 47e7dd7d9ccbe31d57ce716dba831d44 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
4 2 0 30 1a2b2ad3a7f218a0d709dd9c656fda27 e3528f5280f04c0031d337da1def86ea 398773dacf8501ee8f76e3706ccafbba 47e7dd7d9ccbe31d57ce716dba831d44 61c5d8c2524684aa047e15e172c7e92f 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -0.914324

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Check Reward Model Feature Quality

This cell checks missingness and target balance for the reward-model training frame.

Reward modeling is now part of the causal estimator. If feature construction is broken, the direct method and doubly robust estimates will be misleading. This quick check catches obvious issues before model fitting.

feature_quality = pd.DataFrame(
    {
        "feature": feature_cols,
        "missing_rate_train": [X_train[col].isna().mean() for col in feature_cols],
        "missing_rate_eval": [X_eval_logged[col].isna().mean() for col in feature_cols],
        "train_unique_values": [X_train[col].nunique(dropna=False) for col in feature_cols],
    }
)

target_summary = pd.Series(
    {
        "train_rows": len(y_train),
        "eval_rows": len(y_eval),
        "train_click_rate": y_train.mean(),
        "eval_click_rate": y_eval.mean(),
    }
).to_frame("value")

feature_quality, target_summary
(              feature  missing_rate_train  missing_rate_eval  \
 0            position            0.000000           0.000000   
 1                hour            0.000000           0.000000   
 2             item_id            0.000000           0.000000   
 3      user_feature_0            0.000000           0.000000   
 4      user_feature_1            0.000000           0.000000   
 5      user_feature_2            0.000000           0.000000   
 6      user_feature_3            0.000000           0.000000   
 7      item_feature_1            0.000000           0.000000   
 8      item_feature_2            0.000000           0.000000   
 9      item_feature_3            0.000000           0.000000   
 10  selected_affinity            0.000000           0.000000   
 11     item_feature_0            0.000000           0.000000   
 
     train_unique_values  
 0                     3  
 1                    24  
 2                    34  
 3                     4  
 4                     6  
 5                    10  
 6                    10  
 7                     7  
 8                    16  
 9                     4  
 10                    4  
 11                   25  ,
                          value
 train_rows       100000.000000
 eval_rows        100000.000000
 train_click_rate      0.005400
 eval_click_rate       0.004980)

The feature-quality table checks whether the reward-model inputs are present and informative. If important context fields were missing or constant, direct method and DR estimates would lean on weak predictions.

Define The Preprocessor

This cell defines the preprocessing shared by the reward models.

Categorical features are one-hot encoded. Numeric features are scaled. The final transformed matrix is then passed to logistic regression or LightGBM.

The preprocessing is intentionally standard and auditable. For OPE, a slightly weaker but understandable reward model is often better for a portfolio notebook than a highly tuned model whose behavior is hard to explain.

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features),
        ("numeric", StandardScaler(), numeric_features),
    ],
    remainder="drop",
)

The weight diagnostics show whether each policy value estimate is supported by enough logged data. Policies with large weights or low ESS may look attractive but carry higher offline evaluation risk.

Train Reward Models

This cell trains two reward models.

The logistic regression model is a baseline with a simple functional form. The LightGBM model can capture nonlinear interactions between context, item metadata, and affinity. We do not use class weighting because the model probabilities should estimate click probability, not rebalance the classification objective for a different decision threshold.

reward_models = {
    "logistic": Pipeline(
        steps=[
            ("preprocess", preprocessor),
            ("model", LogisticRegression(max_iter=500, solver="lbfgs")),
        ]
    ),
    "lightgbm": Pipeline(
        steps=[
            ("preprocess", preprocessor),
            (
                "model",
                lgb.LGBMClassifier(
                    n_estimators=200,
                    learning_rate=0.05,
                    num_leaves=31,
                    min_child_samples=100,
                    subsample=0.85,
                    colsample_bytree=0.85,
                    random_state=42,
                    verbose=-1,
                ),
            ),
        ]
    ),
}

for model_name, model in reward_models.items():
    model.fit(X_train, y_train)

list(reward_models)
['logistic', 'lightgbm']

The reward models learn expected click probability from logged action-context pairs. These predictions power the direct method and the model-based part of the doubly robust estimator.

Evaluate Reward Model Quality

This cell evaluates each reward model on the held-out logged actions.

The most relevant metrics are probability metrics, not classification accuracy. AUC and average precision measure ranking quality. Log loss and Brier score measure probability quality. Since direct method and DR use predicted probabilities, calibration matters.

model_eval_rows = []
logged_predictions = {}

for model_name, model in reward_models.items():
    pred = model.predict_proba(X_eval_logged)[:, 1]
    logged_predictions[model_name] = pred
    model_eval_rows.append(
        {
            "reward_model": model_name,
            "auc": roc_auc_score(y_eval, pred),
            "average_precision": average_precision_score(y_eval, pred),
            "log_loss": log_loss(y_eval, pred, labels=[0, 1]),
            "brier_score": brier_score_loss(y_eval, pred),
            "mean_prediction": pred.mean(),
            "observed_click_rate": y_eval.mean(),
        }
    )

reward_model_metrics = pd.DataFrame(model_eval_rows).sort_values("log_loss")
reward_model_metrics
reward_model auc average_precision log_loss brier_score mean_prediction observed_click_rate
0 logistic 0.539928 0.006395 0.032528 0.005081 0.005774 0.004980
1 lightgbm 0.533007 0.005812 0.034352 0.005104 0.004888 0.004980

The reward-model diagnostics show whether predicted click probabilities align with observed clicks. Calibration matters because DR uses these predictions as counterfactual baselines before applying residual correction.

Plot Reward Model Calibration

Calibration checks whether predicted probabilities line up with observed click rates. This cell groups predictions into deciles and compares mean predicted click probability with actual click rate.

A perfectly calibrated model would lie near the diagonal. In sparse click data, calibration curves can be noisy, but large systematic gaps are important because direct method estimates depend heavily on probability calibration.

calibration_frames = []
for model_name, pred in logged_predictions.items():
    calibration_frame = pd.DataFrame({"prediction": pred, "click": y_eval.to_numpy()})
    calibration_frame["prediction_bin"] = pd.qcut(
        calibration_frame["prediction"].rank(method="first"), q=10, labels=False
    )
    calibration_summary = (
        calibration_frame.groupby("prediction_bin")
        .agg(mean_prediction=("prediction", "mean"), observed_click_rate=("click", "mean"), rows=("click", "size"))
        .reset_index()
    )
    calibration_summary["reward_model"] = model_name
    calibration_frames.append(calibration_summary)

calibration_df = pd.concat(calibration_frames, ignore_index=True)

fig, ax = plt.subplots(figsize=(6, 5))
sns.lineplot(
    data=calibration_df,
    x="mean_prediction",
    y="observed_click_rate",
    hue="reward_model",
    marker="o",
    ax=ax,
)
max_axis = max(calibration_df["mean_prediction"].max(), calibration_df["observed_click_rate"].max()) * 1.1
ax.plot([0, max_axis], [0, max_axis], color="black", linestyle="--", linewidth=1)
ax.set_title("Reward Model Calibration By Prediction Decile")
ax.set_xlabel("Mean Predicted Click Probability")
ax.set_ylabel("Observed Click Rate")
ax.xaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
plt.tight_layout()
plt.show()

calibration_df

prediction_bin mean_prediction observed_click_rate rows reward_model
0 0 0.001292 0.004000 10000 logistic
1 1 0.001987 0.004200 10000 logistic
2 2 0.002585 0.004200 10000 logistic
3 3 0.003192 0.006300 10000 logistic
4 4 0.003855 0.003600 10000 logistic
5 5 0.004663 0.006100 10000 logistic
6 6 0.005678 0.003600 10000 logistic
7 7 0.007076 0.005000 10000 logistic
8 8 0.009268 0.004400 10000 logistic
9 9 0.018137 0.008400 10000 logistic
10 0 0.000269 0.002900 10000 lightgbm
11 1 0.000718 0.004300 10000 lightgbm
12 2 0.001127 0.004600 10000 lightgbm
13 3 0.001629 0.005700 10000 lightgbm
14 4 0.002242 0.005800 10000 lightgbm
15 5 0.002954 0.006100 10000 lightgbm
16 6 0.003809 0.004100 10000 lightgbm
17 7 0.005085 0.005000 10000 lightgbm
18 8 0.007362 0.004200 10000 lightgbm
19 9 0.023685 0.007100 10000 lightgbm

The reward-model diagnostics show whether predicted click probabilities align with observed clicks. Calibration matters because DR uses these predictions as counterfactual baselines before applying residual correction.

Recreate Evaluation Policies From Notebook 3

This cell rebuilds the same four evaluation policies used in Notebook 3:

  • uniform
  • exposure_popularity
  • ctr_weighted
  • epsilon_greedy_top_ctr

Keeping the policies identical allows us to compare IPS, SNIPS, direct method, and DR estimates without changing the target estimand.

SMOOTHING_ALPHA = 50
train_global_ctr = train_df["click"].mean()

item_stats = (
    train_df.groupby("item_id")
    .agg(train_impressions=("click", "size"), train_clicks=("click", "sum"), train_ctr=("click", "mean"))
    .reindex(action_space, fill_value=0)
    .rename_axis("item_id")
    .reset_index()
)
item_stats["smoothed_ctr"] = (
    item_stats["train_clicks"] + SMOOTHING_ALPHA * train_global_ctr
) / (item_stats["train_impressions"] + SMOOTHING_ALPHA)
item_stats["train_exposure_share"] = item_stats["train_impressions"] / item_stats["train_impressions"].sum()


def normalize_probabilities(values):
    values = np.asarray(values, dtype=float)
    values = np.clip(values, 0, None)
    total = values.sum()
    if total <= 0:
        raise ValueError("Policy scores must have positive total mass.")
    return values / total

uniform_probs = np.full(n_actions, 1 / n_actions)
exposure_popularity_probs = normalize_probabilities(item_stats["train_exposure_share"].to_numpy())
ctr_weighted_probs = normalize_probabilities(item_stats["smoothed_ctr"].to_numpy())

epsilon = 0.15
epsilon_greedy_probs = np.full(n_actions, epsilon / n_actions)
top_ctr_index = int(item_stats["smoothed_ctr"].to_numpy().argmax())
epsilon_greedy_probs[top_ctr_index] += 1 - epsilon

policy_cols = ["uniform", "exposure_popularity", "ctr_weighted", "epsilon_greedy_top_ctr"]
policy_probability_df = pd.DataFrame(
    {
        "item_id": action_space,
        "uniform": uniform_probs,
        "exposure_popularity": exposure_popularity_probs,
        "ctr_weighted": ctr_weighted_probs,
        "epsilon_greedy_top_ctr": epsilon_greedy_probs,
    }
)

policy_probability_df.head()
item_id uniform exposure_popularity ctr_weighted epsilon_greedy_top_ctr
0 0 0.029412 0.029200 0.040498 0.004412
1 1 0.029412 0.027590 0.025514 0.004412
2 2 0.029412 0.032070 0.036929 0.004412
3 3 0.029412 0.029650 0.020188 0.004412
4 4 0.029412 0.028410 0.017318 0.004412

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Audit Evaluation Policy Probabilities

Before using policy probabilities in an estimator, we check that each policy is valid.

A policy must assign probabilities that sum to 1 and must not assign negative probabilities. In this notebook, every candidate policy also gives every action positive probability, which keeps support diagnostics straightforward.

policy_audit = pd.DataFrame(
    [
        {
            "policy": policy,
            "probability_sum": policy_probability_df[policy].sum(),
            "min_probability": policy_probability_df[policy].min(),
            "max_probability": policy_probability_df[policy].max(),
            "positive_actions": int((policy_probability_df[policy] > 0).sum()),
        }
        for policy in policy_cols
    ]
)

policy_audit
policy probability_sum min_probability max_probability positive_actions
0 uniform 1.000000 0.029412 0.029412 34
1 exposure_popularity 1.000000 0.025810 0.033300 34
2 ctr_weighted 1.000000 0.009558 0.080613 34
3 epsilon_greedy_top_ctr 1.000000 0.004412 0.854412 34

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Create Candidate-Action Feature Frames

The direct method and DR estimator need reward predictions for every candidate action in every evaluation context.

This helper builds a feature frame for all (context, candidate item) pairs in a batch. For each context row, it repeats the context features once per action, attaches candidate item metadata, and selects the affinity score for that candidate item.

Batching keeps memory usage reasonable. The full evaluation split has 100,000 contexts and 34 actions, so scoring everything at once would create 3.4 million rows.

def make_candidate_feature_frame(context_df):
    """Create model features for every candidate action in each context row."""
    n_contexts = len(context_df)
    tiled_actions = np.tile(action_space, n_contexts)

    frame = pd.DataFrame(
        {
            "position": np.repeat(context_df["position"].to_numpy(), n_actions),
            "hour": np.repeat(context_df["hour"].to_numpy(), n_actions),
            "item_id": tiled_actions,
        }
    )

    for col in user_feature_cols:
        frame[col] = np.repeat(context_df[col].to_numpy(), n_actions)

    affinity_matrix = context_df[affinity_cols_by_action].to_numpy()
    frame["selected_affinity"] = affinity_matrix.reshape(-1)

    repeated_item_context = item_context.loc[tiled_actions, item_feature_cols].reset_index(drop=True)
    frame = pd.concat([frame, repeated_item_context], axis=1)

    return frame[feature_cols]

candidate_preview = make_candidate_feature_frame(eval_df.head(2))
candidate_preview
position hour item_id user_feature_0 user_feature_1 user_feature_2 user_feature_3 item_feature_1 item_feature_2 item_feature_3 selected_affinity item_feature_0
0 2 10 0 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.677183
1 2 10 1 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.720300
2 2 10 2 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.745662
3 2 10 3 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
4 2 10 4 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
5 2 10 5 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe c43671ed6855a6fe2e2a6030cba64366 bbf748c6c978938bc63d432efa60191c 0.000000 0.142031
6 2 10 6 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 7082af732502f0981a9fe77d7ba1ae8a b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
7 2 10 7 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 2.858372
8 2 10 8 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 01fe2f187e459e6ada960671d2942dfe d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.349294
9 2 10 9 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
10 2 10 10 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.586435
11 2 10 11 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 eddad9910a6d2f61905f408d4df575c5 de8b129010093b09b24a05592bfd8843 0.000000 0.443847
12 2 10 12 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
13 2 10 13 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.616313
14 2 10 14 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -1.000557
15 2 10 15 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 9f9ff361c09f765650f1c43ef7adac86 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.375367
16 2 10 16 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 ce58bf66d7e62186e6ce01bafeea9d39 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.590950
17 2 10 17 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
18 2 10 18 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.914324
19 2 10 19 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.763416
20 2 10 20 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.612508
21 2 10 21 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
22 2 10 22 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.698741
23 2 10 23 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 55fe518d85813954c7d9b8a875ff2453 cc75031396a5aa830885915aa93f49d0 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 -0.569392
24 2 10 24 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.422288
25 2 10 25 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
26 2 10 26 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 3c2985d744e0d57c261abd7e541e4263 a86ead010f033dbc2854c6a46f4fe7a7 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.896570
27 2 10 27 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.849649
28 2 10 28 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -1.065232
29 2 10 29 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.849649
30 2 10 30 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 61c5d8c2524684aa047e15e172c7e92f 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -0.914324
31 2 10 31 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 55fe518d85813954c7d9b8a875ff2453 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
32 2 10 32 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.526275
33 2 10 33 cef3390ed299c09874189c387777674a 2d03db5543b14483e52d761760686b64 2723d2eb8bba04e0362098011fa3997b 9bde591ffaab8d54c457448e4dca6f53 e95f0d1a3591e01d7ed3f0710424e84d 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.612508
34 2 10 0 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.677183
35 2 10 1 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.720300
36 2 10 2 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.745662
37 2 10 3 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
38 2 10 4 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe b4b5879029fb5f64eeec63cf4f73ef0e b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
39 2 10 5 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe c43671ed6855a6fe2e2a6030cba64366 bbf748c6c978938bc63d432efa60191c 0.000000 0.142031
40 2 10 6 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 7082af732502f0981a9fe77d7ba1ae8a b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.651109
41 2 10 7 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 2.858372
42 2 10 8 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 01fe2f187e459e6ada960671d2942dfe d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.349294
43 2 10 9 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
44 2 10 10 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 2b851c0a9c4a961da8760d5dc747c5a3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.586435
45 2 10 11 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 eddad9910a6d2f61905f408d4df575c5 de8b129010093b09b24a05592bfd8843 0.000000 0.443847
46 2 10 12 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 1.198386
47 2 10 13 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 697cbf60c7c4b8569c149721231538c3 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.616313
48 2 10 14 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -1.000557
49 2 10 15 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 9f9ff361c09f765650f1c43ef7adac86 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.375367
50 2 10 16 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 ce58bf66d7e62186e6ce01bafeea9d39 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.590950
51 2 10 17 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
52 2 10 18 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.914324
53 2 10 19 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.763416
54 2 10 20 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.612508
55 2 10 21 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 ce1abd8b5d914ba8fe719b453bc5ba3b 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.698741
56 2 10 22 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.698741
57 2 10 23 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 55fe518d85813954c7d9b8a875ff2453 cc75031396a5aa830885915aa93f49d0 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 -0.569392
58 2 10 24 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d d7f03898d040700d6e1810d21e669958 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.422288
59 2 10 25 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 9874ffb54e9b0a269e29bbb2f5328735 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
60 2 10 26 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 3c2985d744e0d57c261abd7e541e4263 a86ead010f033dbc2854c6a46f4fe7a7 b61cfaadd526b816e3aeb9b7be4b4759 0.000000 0.896570
61 2 10 27 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 5d5dd3635cb3f84d3a70f5874a132d44 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.849649
62 2 10 28 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -1.065232
63 2 10 29 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.849649
64 2 10 30 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 61c5d8c2524684aa047e15e172c7e92f 3f1feafd79578bedf199c459fecc378b bbf748c6c978938bc63d432efa60191c 0.000000 -0.914324
65 2 10 31 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 55fe518d85813954c7d9b8a875ff2453 7c63a6aa72e655abd1787c2e64385e6f bbf748c6c978938bc63d432efa60191c 0.000000 -0.461600
66 2 10 32 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d b726ac74a20945400f27294febd4ab55 5bc9c86cd1f08a9991670ea97b34f86d 0.000000 -0.526275
67 2 10 33 cef3390ed299c09874189c387777674a 03a5648a76832f83c859d46bc06cb64a 9b2d331c329ceb74d3dcfb48d8798c78 f97571b9c14a786aab269f0b427d2a85 e95f0d1a3591e01d7ed3f0710424e84d 7c5498711d69681385d21c0e26923e7e bbf748c6c978938bc63d432efa60191c 0.000000 -0.612508

These feature-building cells define the context used by reward models. Reward models need both user context and candidate item context so they can predict counterfactual rewards for actions that were not logged.

Compute Direct-Method Components In Batches

This helper scores every evaluation context against every candidate item for one reward model.

For each context, it computes:

sum_a pi_e(a|x) * q_hat(x, a)

Because our evaluation policies are context-free, the policy probabilities do not vary by row. The reward predictions still vary by row because user features, position, hour, and selected affinity vary by context.

def compute_direct_components(model, contexts_df, batch_size=10_000):
    """Return one direct-method component per context and policy."""
    policy_probability_matrix = policy_probability_df[policy_cols].to_numpy()
    component_batches = []

    for start in range(0, len(contexts_df), batch_size):
        context_batch = contexts_df.iloc[start : start + batch_size]
        candidate_features = make_candidate_feature_frame(context_batch)
        q_hat = model.predict_proba(candidate_features)[:, 1].reshape(len(context_batch), n_actions)
        direct_components = q_hat @ policy_probability_matrix
        component_batches.append(direct_components)

    components = np.vstack(component_batches)
    return pd.DataFrame(components, columns=policy_cols, index=contexts_df.index)

These cells score candidate actions under the reward model, which is what lets the direct method estimate values for policies that choose actions different from the logged one. This is the model-based complement to importance weighting.

Score Logged Actions And Candidate Actions

This cell produces the two reward-model prediction objects needed for DR:

  • q_hat_logged: predicted click probability for the item that was actually logged
  • direct_components: expected predicted reward under each evaluation policy for each context

The direct components are the model-based policy values before residual correction.

q_logged_by_model = {}
direct_components_by_model = {}

for model_name, model in reward_models.items():
    q_logged_by_model[model_name] = model.predict_proba(X_eval_logged)[:, 1]
    direct_components_by_model[model_name] = compute_direct_components(model, eval_df, batch_size=10_000)

pd.DataFrame(
    {
        "reward_model": list(q_logged_by_model),
        "mean_q_logged": [pred.mean() for pred in q_logged_by_model.values()],
        "direct_component_rows": [len(direct_components_by_model[name]) for name in q_logged_by_model],
    }
)
reward_model mean_q_logged direct_component_rows
0 logistic 0.005774 100000
1 lightgbm 0.004888 100000

These cells score candidate actions under the reward model, which is what lets the direct method estimate values for policies that choose actions different from the logged one. This is the model-based complement to importance weighting.

Attach Evaluation Policy Weights

This cell computes pi_e(A|X) / pi_b(A|X) for the logged action under each evaluation policy.

These are the same importance weights used in Notebook 3. In the doubly robust estimator, the weights multiply the reward-model residual Y - q_hat(X, A) rather than the raw reward Y.

eval_scored = eval_df[["timestamp", "item_id", "position", "click", "propensity_score"]].copy()

for policy in policy_cols:
    probability_map = policy_probability_df.set_index("item_id")[policy]
    eval_scored[f"pi_e_{policy}"] = eval_scored["item_id"].map(probability_map)
    eval_scored[f"w_{policy}"] = eval_scored[f"pi_e_{policy}"] / eval_scored["propensity_score"]

weight_check = pd.DataFrame(
    [
        {
            "policy": policy,
            "mean_weight": eval_scored[f"w_{policy}"].mean(),
            "max_weight": eval_scored[f"w_{policy}"].max(),
            "p99_weight": np.percentile(eval_scored[f"w_{policy}"], 99),
        }
        for policy in policy_cols
    ]
)

weight_check
policy mean_weight max_weight p99_weight
0 uniform 1.000000 1.000000 1.000000
1 exposure_popularity 1.000591 1.132200 1.132200
2 ctr_weighted 0.996987 2.740842 2.740842
3 epsilon_greedy_top_ctr 1.001105 29.050000 29.050000

Attaching evaluation-policy probabilities to logged rows creates the numerator of the importance weight. Once this is joined to the behavior propensity, IPS and SNIPS can estimate each policy’s value.

Define OPE Estimator Helpers

This cell defines helper functions for IPS, SNIPS, direct method, and doubly robust estimation.

Each estimator returns an estimate, approximate standard error, confidence interval, and weight diagnostics where relevant. The confidence intervals use large-sample standard errors. They are useful for comparison, but later production work would likely use richer bootstrap or repeated-split diagnostics.

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return weights.sum() ** 2 / np.square(weights).sum()


def summarize_signal(signal):
    signal = np.asarray(signal, dtype=float)
    estimate = signal.mean()
    se = signal.std(ddof=1) / np.sqrt(len(signal))
    return estimate, se, estimate - 1.96 * se, estimate + 1.96 * se


def estimate_ips_snips(reward, weight):
    reward = np.asarray(reward, dtype=float)
    weight = np.asarray(weight, dtype=float)
    ips_signal = weight * reward
    ips, ips_se, ips_lower, ips_upper = summarize_signal(ips_signal)

    snips = ips_signal.sum() / weight.sum()
    snips_influence = weight * (reward - snips) / weight.mean()
    snips_se = snips_influence.std(ddof=1) / np.sqrt(len(snips_influence))

    return {
        "ips": (ips, ips_se, ips_lower, ips_upper),
        "snips": (snips, snips_se, snips - 1.96 * snips_se, snips + 1.96 * snips_se),
        "ess_share": effective_sample_size(weight) / len(weight),
        "mean_weight": weight.mean(),
        "max_weight": weight.max(),
    }


def estimate_dm_dr(reward, weight, q_logged, direct_component):
    reward = np.asarray(reward, dtype=float)
    weight = np.asarray(weight, dtype=float)
    q_logged = np.asarray(q_logged, dtype=float)
    direct_component = np.asarray(direct_component, dtype=float)

    dm, dm_se, dm_lower, dm_upper = summarize_signal(direct_component)
    dr_signal = direct_component + weight * (reward - q_logged)
    dr, dr_se, dr_lower, dr_upper = summarize_signal(dr_signal)

    correction = dr_signal - direct_component
    return {
        "dm": (dm, dm_se, dm_lower, dm_upper),
        "dr": (dr, dr_se, dr_lower, dr_upper),
        "mean_abs_correction": np.abs(correction).mean(),
        "mean_correction": correction.mean(),
    }

The helper functions encode the estimator formulas and diagnostics used repeatedly in the notebook. Defining them once keeps the later policy comparisons consistent and easier to audit.

Estimate IPS, SNIPS, Direct Method, And DR

This cell computes the main OPE result table.

For each policy, IPS and SNIPS depend only on propensities and observed rewards. Direct method and DR are computed separately for each reward model. The DR estimate uses the reward model prediction plus an importance-weighted residual correction.

result_rows = []
reward = eval_scored["click"].to_numpy()

for policy in policy_cols:
    weight = eval_scored[f"w_{policy}"].to_numpy()
    ips_snips = estimate_ips_snips(reward, weight)

    for estimator_name, values in [("IPS", ips_snips["ips"]), ("SNIPS", ips_snips["snips"] )]:
        estimate, se, lower, upper = values
        result_rows.append(
            {
                "policy": policy,
                "estimator": estimator_name,
                "reward_model": "none",
                "estimate": estimate,
                "se": se,
                "ci_95_lower": lower,
                "ci_95_upper": upper,
                "ess_share": ips_snips["ess_share"],
                "mean_weight": ips_snips["mean_weight"],
                "max_weight": ips_snips["max_weight"],
                "mean_abs_correction": np.nan,
                "mean_correction": np.nan,
            }
        )

    for model_name in reward_models:
        dm_dr = estimate_dm_dr(
            reward=reward,
            weight=weight,
            q_logged=q_logged_by_model[model_name],
            direct_component=direct_components_by_model[model_name][policy],
        )
        for estimator_name, values in [("DM", dm_dr["dm"]), ("DR", dm_dr["dr"] )]:
            estimate, se, lower, upper = values
            result_rows.append(
                {
                    "policy": policy,
                    "estimator": estimator_name,
                    "reward_model": model_name,
                    "estimate": estimate,
                    "se": se,
                    "ci_95_lower": lower,
                    "ci_95_upper": upper,
                    "ess_share": ips_snips["ess_share"],
                    "mean_weight": ips_snips["mean_weight"],
                    "max_weight": ips_snips["max_weight"],
                    "mean_abs_correction": dm_dr["mean_abs_correction"],
                    "mean_correction": dm_dr["mean_correction"],
                }
            )

all_estimates = pd.DataFrame(result_rows)
all_estimates.head(12)
policy estimator reward_model estimate se ci_95_lower ci_95_upper ess_share mean_weight max_weight mean_abs_correction mean_correction
0 uniform IPS none 0.004980 0.000223 0.004544 0.005416 1.000000 1.000000 1.000000 NaN NaN
1 uniform SNIPS none 0.004980 0.000223 0.004544 0.005416 1.000000 1.000000 1.000000 NaN NaN
2 uniform DM logistic 0.005786 0.000011 0.005764 0.005807 1.000000 1.000000 1.000000 0.010685 -0.000794
3 uniform DR logistic 0.004992 0.000225 0.004550 0.005434 1.000000 1.000000 1.000000 0.010685 -0.000794
4 uniform DM lightgbm 0.004943 0.000013 0.004918 0.004968 1.000000 1.000000 1.000000 0.009813 0.000092
5 uniform DR lightgbm 0.005035 0.000226 0.004592 0.005477 1.000000 1.000000 1.000000 0.009813 0.000092
6 exposure_popularity IPS none 0.004971 0.000223 0.004535 0.005407 0.996399 1.000591 1.132200 NaN NaN
7 exposure_popularity SNIPS none 0.004968 0.000223 0.004532 0.005404 0.996399 1.000591 1.132200 NaN NaN
8 exposure_popularity DM logistic 0.005733 0.000011 0.005711 0.005754 0.996399 1.000591 1.132200 0.010630 -0.000756
9 exposure_popularity DR logistic 0.004977 0.000225 0.004535 0.005418 0.996399 1.000591 1.132200 0.010630 -0.000756
10 exposure_popularity DM lightgbm 0.004913 0.000013 0.004888 0.004938 0.996399 1.000591 1.132200 0.009779 0.000108
11 exposure_popularity DR lightgbm 0.005021 0.000226 0.004578 0.005463 0.996399 1.000591 1.132200 0.009779 0.000108

This table compares the main OPE estimator families. DR is especially useful because it combines reward-model predictions with importance-weighted residuals, reducing reliance on either component alone.

Add The Observed Random-Policy Baseline

The held-out observed click rate is the on-policy value estimate for the random behavior policy. It is not counterfactual, but it gives the table a concrete baseline.

Policy lift in later cells is measured relative to this observed random-policy click rate.

behavior_value = eval_scored["click"].mean()
behavior_se = eval_scored["click"].std(ddof=1) / np.sqrt(len(eval_scored))

observed_baseline = pd.DataFrame(
    [
        {
            "policy": "behavior_random_observed",
            "estimator": "Observed",
            "reward_model": "none",
            "estimate": behavior_value,
            "se": behavior_se,
            "ci_95_lower": behavior_value - 1.96 * behavior_se,
            "ci_95_upper": behavior_value + 1.96 * behavior_se,
            "ess_share": 1.0,
            "mean_weight": 1.0,
            "max_weight": 1.0,
            "mean_abs_correction": np.nan,
            "mean_correction": np.nan,
        }
    ]
)

estimate_table = pd.concat([observed_baseline, all_estimates], ignore_index=True)
estimate_table["lift_pp"] = 100 * (estimate_table["estimate"] - behavior_value)
estimate_table["relative_lift_pct"] = 100 * (estimate_table["estimate"] / behavior_value - 1)

estimate_table.sort_values(["policy", "estimator", "reward_model"]).head(20)
policy estimator reward_model estimate se ci_95_lower ci_95_upper ess_share mean_weight max_weight mean_abs_correction mean_correction lift_pp relative_lift_pct
0 behavior_random_observed Observed none 0.004980 0.000223 0.004544 0.005416 1.000000 1.000000 1.000000 NaN NaN 0.000000 0.000000
17 ctr_weighted DM lightgbm 0.006021 0.000016 0.005989 0.006052 0.790628 0.996987 2.740842 0.011017 -0.000741 0.104055 20.894635
15 ctr_weighted DM logistic 0.007355 0.000014 0.007328 0.007382 0.790628 0.996987 2.740842 0.012451 -0.002195 0.237523 47.695402
18 ctr_weighted DR lightgbm 0.005280 0.000267 0.004756 0.005803 0.790628 0.996987 2.740842 0.011017 -0.000741 0.029956 6.015257
16 ctr_weighted DR logistic 0.005160 0.000268 0.004634 0.005687 0.790628 0.996987 2.740842 0.012451 -0.002195 0.018048 3.624090
13 ctr_weighted IPS none 0.005172 0.000261 0.004660 0.005685 0.790628 0.996987 2.740842 NaN NaN 0.019244 3.864167
14 ctr_weighted SNIPS none 0.005188 0.000262 0.004675 0.005701 0.790628 0.996987 2.740842 NaN NaN 0.020806 4.178004
23 epsilon_greedy_top_ctr DM lightgbm 0.011740 0.000068 0.011607 0.011873 0.040290 1.001105 29.050000 0.017472 -0.005122 0.676002 135.743313
21 epsilon_greedy_top_ctr DM logistic 0.014628 0.000025 0.014580 0.014676 0.040290 1.001105 29.050000 0.020619 -0.008321 0.964777 193.730292
24 epsilon_greedy_top_ctr DR lightgbm 0.006618 0.001324 0.004023 0.009214 0.040290 1.001105 29.050000 0.017472 -0.005122 0.163824 32.896440
22 epsilon_greedy_top_ctr DR logistic 0.006306 0.001279 0.003799 0.008814 0.040290 1.001105 29.050000 0.020619 -0.008321 0.132637 26.633840
19 epsilon_greedy_top_ctr IPS none 0.006238 0.001267 0.003756 0.008720 0.040290 1.001105 29.050000 NaN NaN 0.125800 25.261044
20 epsilon_greedy_top_ctr SNIPS none 0.006231 0.001261 0.003759 0.008703 0.040290 1.001105 29.050000 NaN NaN 0.125111 25.122784
11 exposure_popularity DM lightgbm 0.004913 0.000013 0.004888 0.004938 0.996399 1.000591 1.132200 0.009779 0.000108 -0.006709 -1.347183
9 exposure_popularity DM logistic 0.005733 0.000011 0.005711 0.005754 0.996399 1.000591 1.132200 0.010630 -0.000756 0.075267 15.113877
12 exposure_popularity DR lightgbm 0.005021 0.000226 0.004578 0.005463 0.996399 1.000591 1.132200 0.009779 0.000108 0.004067 0.816622
10 exposure_popularity DR logistic 0.004977 0.000225 0.004535 0.005418 0.996399 1.000591 1.132200 0.010630 -0.000756 -0.000340 -0.068183
7 exposure_popularity IPS none 0.004971 0.000223 0.004535 0.005407 0.996399 1.000591 1.132200 NaN NaN -0.000904 -0.181598
8 exposure_popularity SNIPS none 0.004968 0.000223 0.004532 0.005404 0.996399 1.000591 1.132200 NaN NaN -0.001198 -0.240522
5 uniform DM lightgbm 0.004943 0.000013 0.004918 0.004968 1.000000 1.000000 1.000000 0.009813 0.000092 -0.003706 -0.744091

Adding the observed behavior-policy baseline gives a familiar reference point. Evaluation policies can now be interpreted as offline alternatives to the policy that generated the held-out log.

Compare Main Estimators Visually

This plot compares IPS, SNIPS, and LightGBM DR estimates. LightGBM DR is shown because it is the strongest reward model in this notebook, while IPS and SNIPS provide continuity with Notebook 3.

The confidence intervals are approximate. The main purpose of the plot is to show whether DR stabilizes policy rankings and whether it agrees directionally with pure importance weighting.

plot_estimates = estimate_table[
    (estimate_table["estimator"].isin(["Observed", "IPS", "SNIPS"]))
    | ((estimate_table["estimator"] == "DR") & (estimate_table["reward_model"] == "lightgbm"))
].copy()
plot_estimates["estimator_label"] = np.where(
    plot_estimates["estimator"] == "DR",
    "DR LightGBM",
    plot_estimates["estimator"],
)
plot_estimates["lower_error"] = plot_estimates["estimate"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["estimate"]

policy_order = plot_estimates["policy"].drop_duplicates().tolist()
estimator_order = ["Observed", "IPS", "SNIPS", "DR LightGBM"]
offsets = {"Observed": 0.0, "IPS": -0.24, "SNIPS": 0.0, "DR LightGBM": 0.24}
colors = {"Observed": "#4C78A8", "IPS": "#F58518", "SNIPS": "#54A24B", "DR LightGBM": "#B279A2"}

fig, ax = plt.subplots(figsize=(12, 5))
for estimator in estimator_order:
    subset = plot_estimates[plot_estimates["estimator_label"] == estimator]
    for _, row in subset.iterrows():
        x_base = policy_order.index(row["policy"])
        x = x_base + offsets[estimator]
        ax.errorbar(
            x=x,
            y=row["estimate"],
            yerr=[[row["lower_error"]], [row["upper_error"]]],
            fmt="o",
            color=colors[estimator],
            ecolor=colors[estimator],
            capsize=4,
            linewidth=1.4,
            markersize=6,
            label=estimator if row["policy"] == subset["policy"].iloc[0] else None,
        )

ax.set_xticks(range(len(policy_order)))
ax.set_xticklabels(policy_order, rotation=25, ha="right")
ax.set_title("Policy Value Estimates: IPS, SNIPS, And DR")
ax.set_xlabel("Policy")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
ax.legend(title="Estimator")
plt.tight_layout()
plt.show()

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Direct Method Versus DR By Reward Model

This table compares direct method and DR estimates for both reward models.

The direct method is fully model-based. DR adds the residual correction. If the correction is large, it means observed outcomes disagree with the reward model in a way that matters for that policy. Large corrections are not automatically bad, but they deserve interpretation.

dm_dr_table = estimate_table[estimate_table["estimator"].isin(["DM", "DR"])].copy()
dm_dr_table = dm_dr_table[
    [
        "policy",
        "reward_model",
        "estimator",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "mean_abs_correction",
        "mean_correction",
    ]
].sort_values(["policy", "reward_model", "estimator"])

dm_dr_table
policy reward_model estimator estimate ci_95_lower ci_95_upper lift_pp relative_lift_pct mean_abs_correction mean_correction
17 ctr_weighted lightgbm DM 0.006021 0.005989 0.006052 0.104055 20.894635 0.011017 -0.000741
18 ctr_weighted lightgbm DR 0.005280 0.004756 0.005803 0.029956 6.015257 0.011017 -0.000741
15 ctr_weighted logistic DM 0.007355 0.007328 0.007382 0.237523 47.695402 0.012451 -0.002195
16 ctr_weighted logistic DR 0.005160 0.004634 0.005687 0.018048 3.624090 0.012451 -0.002195
23 epsilon_greedy_top_ctr lightgbm DM 0.011740 0.011607 0.011873 0.676002 135.743313 0.017472 -0.005122
24 epsilon_greedy_top_ctr lightgbm DR 0.006618 0.004023 0.009214 0.163824 32.896440 0.017472 -0.005122
21 epsilon_greedy_top_ctr logistic DM 0.014628 0.014580 0.014676 0.964777 193.730292 0.020619 -0.008321
22 epsilon_greedy_top_ctr logistic DR 0.006306 0.003799 0.008814 0.132637 26.633840 0.020619 -0.008321
11 exposure_popularity lightgbm DM 0.004913 0.004888 0.004938 -0.006709 -1.347183 0.009779 0.000108
12 exposure_popularity lightgbm DR 0.005021 0.004578 0.005463 0.004067 0.816622 0.009779 0.000108
9 exposure_popularity logistic DM 0.005733 0.005711 0.005754 0.075267 15.113877 0.010630 -0.000756
10 exposure_popularity logistic DR 0.004977 0.004535 0.005418 -0.000340 -0.068183 0.010630 -0.000756
5 uniform lightgbm DM 0.004943 0.004918 0.004968 -0.003706 -0.744091 0.009813 0.000092
6 uniform lightgbm DR 0.005035 0.004592 0.005477 0.005491 1.102546 0.009813 0.000092
3 uniform logistic DM 0.005786 0.005764 0.005807 0.080553 16.175230 0.010685 -0.000794
4 uniform logistic DR 0.004992 0.004550 0.005434 0.001201 0.241083 0.010685 -0.000794

Comparing DM and DR shows how much the residual correction changes the model-only estimate. Large differences mean observed logged rewards are correcting the reward model substantially.

Plot Direct Method Versus DR

This plot focuses only on model-based estimators. It shows how much the DR correction moves the direct-method estimates for each reward model.

The labels combine estimator and reward model, for example DR lightgbm. This avoids overloading Seaborn with multiple grouping variables and makes the comparison easier to read.

dm_dr_plot = dm_dr_table.copy()
dm_dr_plot["estimator_model"] = dm_dr_plot["estimator"] + " " + dm_dr_plot["reward_model"]

fig, ax = plt.subplots(figsize=(12, 5))
sns.pointplot(
    data=dm_dr_plot,
    x="policy",
    y="estimate",
    hue="estimator_model",
    dodge=0.45,
    errorbar=None,
    ax=ax,
)
ax.set_title("Direct Method Versus Doubly Robust Estimates")
ax.set_xlabel("Policy")
ax.set_ylabel("Estimated Click Rate")
ax.tick_params(axis="x", rotation=25)
ax.yaxis.set_major_formatter(lambda y, _: f"{y:.2%}")
ax.legend(title="Estimator / reward model", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()

Comparing DM and DR shows how much the residual correction changes the model-only estimate. Large differences mean observed logged rewards are correcting the reward model substantially.

DR Correction Diagnostics

This cell summarizes the size of the residual correction in the DR estimator. The correction term is:

weight * (Y - q_hat(X, A))

A correction near zero on average means the direct method already aligns with observed residuals under that policy. A larger correction means the logged outcomes are materially changing the model-based estimate.

correction_diagnostics = (
    estimate_table.query("estimator == 'DR'")
    [["policy", "reward_model", "mean_correction", "mean_abs_correction", "ess_share", "max_weight"]]
    .sort_values("mean_abs_correction", ascending=False)
)

correction_diagnostics
policy reward_model mean_correction mean_abs_correction ess_share max_weight
22 epsilon_greedy_top_ctr logistic -0.008321 0.020619 0.040290 29.050000
24 epsilon_greedy_top_ctr lightgbm -0.005122 0.017472 0.040290 29.050000
16 ctr_weighted logistic -0.002195 0.012451 0.790628 2.740842
18 ctr_weighted lightgbm -0.000741 0.011017 0.790628 2.740842
4 uniform logistic -0.000794 0.010685 1.000000 1.000000
10 exposure_popularity logistic -0.000756 0.010630 0.996399 1.132200
6 uniform lightgbm 0.000092 0.009813 1.000000 1.000000
12 exposure_popularity lightgbm 0.000108 0.009779 0.996399 1.132200

The correction diagnostics reveal how much of the DR estimate comes from importance-weighted residuals. Large or unstable corrections point to either support issues, reward-model misspecification, or both.

LightGBM Feature Importance

This cell extracts the top LightGBM feature importances after preprocessing.

Feature importance is not a causal explanation. It only tells us which transformed features the reward model used for prediction. Still, it is useful for debugging and storytelling: a sensible model should use item identity, position, user features, item metadata, or affinity signals rather than random artifacts.

lightgbm_pipeline = reward_models["lightgbm"]
feature_names = lightgbm_pipeline.named_steps["preprocess"].get_feature_names_out()
feature_importances = lightgbm_pipeline.named_steps["model"].feature_importances_

importance_df = (
    pd.DataFrame({"feature": feature_names, "importance": feature_importances})
    .sort_values("importance", ascending=False)
    .head(20)
)

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=importance_df.sort_values("importance"), x="importance", y="feature", ax=ax, color="#4C78A8")
ax.set_title("Top LightGBM Reward Model Features")
ax.set_xlabel("Feature Importance")
ax.set_ylabel("Feature")
plt.tight_layout()
plt.show()

importance_df

feature importance
119 numeric__item_feature_0 350
0 categorical__position_1 236
87 categorical__user_feature_3_9bde591ffaab8d54c4... 174
1 categorical__position_2 173
2 categorical__position_3 171
88 categorical__user_feature_3_c39b0c7dd5d4eb9a18... 139
118 numeric__selected_affinity 136
61 categorical__user_feature_0_1a2b2ad3a7f218a0d7... 122
77 categorical__user_feature_2_9b2d331c329ceb74d3... 120
74 categorical__user_feature_2_719dab53a7560218a9... 119
12 categorical__hour_9 116
4 categorical__hour_1 116
18 categorical__hour_15 114
15 categorical__hour_12 112
71 categorical__user_feature_2_2723d2eb8bba04e036... 104
65 categorical__user_feature_1_03a5648a76832f83c8... 103
17 categorical__hour_14 100
3 categorical__hour_0 96
82 categorical__user_feature_3_06128286bcc64b6a4b... 93
90 categorical__user_feature_3_f97571b9c14a786aab... 92

Feature importance helps explain what the reward model uses to predict clicks. These importances support model interpretation, but they should not be read as causal effects of the features themselves.

Weight And Model Diagnostics Together

This cell brings together the two main sources of uncertainty:

  • importance-weight stability
  • reward-model quality

A good DR estimate should have reasonable support and a reward model that is at least directionally predictive. Weak weights plus a weak reward model would make any offline policy conclusion fragile.

lightgbm_metric_row = reward_model_metrics.query("reward_model == 'lightgbm'").iloc[0]
combined_diagnostics = (
    estimate_table.query("estimator == 'DR' and reward_model == 'lightgbm'")
    [["policy", "estimate", "lift_pp", "ess_share", "mean_weight", "max_weight", "mean_abs_correction"]]
    .assign(
        reward_model_auc=lightgbm_metric_row["auc"],
        reward_model_log_loss=lightgbm_metric_row["log_loss"],
        reward_model_brier=lightgbm_metric_row["brier_score"],
    )
    .sort_values("estimate", ascending=False)
)

combined_diagnostics
policy estimate lift_pp ess_share mean_weight max_weight mean_abs_correction reward_model_auc reward_model_log_loss reward_model_brier
24 epsilon_greedy_top_ctr 0.006618 0.163824 0.040290 1.001105 29.050000 0.017472 0.533007 0.034352 0.005104
18 ctr_weighted 0.005280 0.029956 0.790628 0.996987 2.740842 0.011017 0.533007 0.034352 0.005104
6 uniform 0.005035 0.005491 1.000000 1.000000 1.000000 0.009813 0.533007 0.034352 0.005104
12 exposure_popularity 0.005021 0.004067 0.996399 1.000591 1.132200 0.009779 0.533007 0.034352 0.005104

Combining weight and model diagnostics gives a fuller risk picture than policy value alone. A policy should look good not only on estimated reward, but also on support, ESS, and model quality.

Main Notebook Result Table

This final result table is the compact version to reference in a portfolio writeup.

It focuses on LightGBM DR because that is the strongest reward-model version in this notebook, while still retaining IPS and SNIPS as benchmarks from the previous notebook. The best policy should be interpreted alongside ESS, confidence intervals, and the observational limitations of logged bandit data.

main_result_table = estimate_table[
    (estimate_table["estimator"].isin(["Observed", "IPS", "SNIPS"]))
    | ((estimate_table["estimator"] == "DR") & (estimate_table["reward_model"] == "lightgbm"))
].copy()
main_result_table = main_result_table[
    [
        "policy",
        "estimator",
        "reward_model",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "ess_share",
        "mean_weight",
        "max_weight",
    ]
].sort_values(["estimate", "ess_share"], ascending=[False, False])

main_result_table
policy estimator reward_model estimate ci_95_lower ci_95_upper lift_pp relative_lift_pct ess_share mean_weight max_weight
24 epsilon_greedy_top_ctr DR lightgbm 0.006618 0.004023 0.009214 0.163824 32.896440 0.040290 1.001105 29.050000
19 epsilon_greedy_top_ctr IPS none 0.006238 0.003756 0.008720 0.125800 25.261044 0.040290 1.001105 29.050000
20 epsilon_greedy_top_ctr SNIPS none 0.006231 0.003759 0.008703 0.125111 25.122784 0.040290 1.001105 29.050000
18 ctr_weighted DR lightgbm 0.005280 0.004756 0.005803 0.029956 6.015257 0.790628 0.996987 2.740842
14 ctr_weighted SNIPS none 0.005188 0.004675 0.005701 0.020806 4.178004 0.790628 0.996987 2.740842
13 ctr_weighted IPS none 0.005172 0.004660 0.005685 0.019244 3.864167 0.790628 0.996987 2.740842
6 uniform DR lightgbm 0.005035 0.004592 0.005477 0.005491 1.102546 1.000000 1.000000 1.000000
12 exposure_popularity DR lightgbm 0.005021 0.004578 0.005463 0.004067 0.816622 0.996399 1.000591 1.132200
1 uniform IPS none 0.004980 0.004544 0.005416 0.000000 0.000000 1.000000 1.000000 1.000000
0 behavior_random_observed Observed none 0.004980 0.004544 0.005416 0.000000 0.000000 1.000000 1.000000 1.000000
2 uniform SNIPS none 0.004980 0.004544 0.005416 0.000000 0.000000 1.000000 1.000000 1.000000
7 exposure_popularity IPS none 0.004971 0.004535 0.005407 -0.000904 -0.181598 0.996399 1.000591 1.132200
8 exposure_popularity SNIPS none 0.004968 0.004532 0.005404 -0.001198 -0.240522 0.996399 1.000591 1.132200

This output is part of the doubly robust off-policy evaluation workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Notebook 4 Takeaways

This notebook introduced doubly robust OPE for the Open Bandit project.

The key ideas are:

  • IPS and SNIPS use propensities and observed rewards directly.
  • The direct method uses a reward model to predict outcomes for every candidate action.
  • Doubly robust OPE combines the direct method with an IPS-style residual correction.
  • Reward-model diagnostics matter because DR is only useful if the model contains real signal.
  • Weight diagnostics still matter because the residual correction uses importance weights.
  • Comparing logistic DR and LightGBM DR is a useful sensitivity check.

The next notebook should focus on policy comparison and sensitivity: clipping thresholds, alternate reward models, train/evaluation split sensitivity, and a polished recommendation about which policy is most credible offline.