07. Off-Policy Evaluation

Off-policy evaluation (OPE) is the problem of estimating how a new decision policy would perform using data collected under an older policy.

This is one of the most industry-relevant uses of causal inference. Many companies have logs from recommenders, marketing systems, pricing engines, support-routing tools, ranking systems, sales playbooks, and human decision processes. A new policy may look promising, but deploying it directly can be expensive or risky. OPE asks:

Can we estimate the value of the new policy before exposing customers, users, patients, sellers, or accounts to it?

The answer is “sometimes.” OPE can be very powerful when the logging policy is known, stochastic, and has enough overlap with the target policy. It can be dangerously misleading when those conditions fail.

Learning Goals

By the end of this notebook, you should be able to:

Define logging policies, target policies, contextual bandit feedback, and policy value.
Explain why naive offline evaluation is biased under partial feedback.
Implement direct method, inverse propensity scoring, self-normalized IPS, clipped IPS, and doubly robust OPE.
Diagnose support and overlap problems through importance weights.
Compare estimator bias and variance against simulation truth.
Explain why logged propensities are a production requirement for reliable OPE.
Translate OPE results into a cautious industry recommendation.

Dataset and Experiment Setup

This notebook simulates logged interventions from an existing operational policy. The unit is an account or case. For each unit, the data includes context, the action actually taken, the logging propensity, and the observed reward. Actions might represent different support levels, sales plays, retention offers, or escalation paths. The goal is to evaluate a new policy using data collected under an old policy.

This setting must be simulated because off-policy evaluation needs the true value of candidate policies for teaching. In real logs, we observe only the reward for the action that was actually taken. The counterfactual reward under a different action is hidden. The simulation includes nonuniform action assignment, propensities, reward heterogeneity, and limited support so IPS, self-normalized IPS, doubly robust estimators, and effective sample size diagnostics can be interpreted against known truth. That makes the example closer to production decision systems than a balanced randomized comparison.

import warnings

import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import KFold, train_test_split

warnings.filterwarnings("ignore")

sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.float_format", "{:.3f}".format)

# Define reusable helpers so the later analytical cells stay focused on the causal argument.
ACTION_LABELS = {
    0: "No action",
    1: "Email education",
    2: "Discount offer",
    3: "Human success call",
}
N_ACTIONS = len(ACTION_LABELS)


def sigmoid(x):
    """Compute sigmoid for the off policy evaluation example.
    
    Inputs:
    - `x`: logit-scale numeric input converted into a probability.
    Returns: a NumPy array or scalar of logistic probabilities."""
    return 1 / (1 + np.exp(-x))


def row_softmax(utilities):
    """Compute row softmax for the off policy evaluation example.
    
    Inputs:
    - `utilities`: row-wise action utility scores before softmax normalization.
    Returns: one probability vector per row of the score matrix."""
    shifted = utilities - utilities.max(axis=1, keepdims=True)
    exp_u = np.exp(shifted)
    return exp_u / exp_u.sum(axis=1, keepdims=True)


def one_hot(actions, n_actions=N_ACTIONS):
    """Compute one hot for the off policy evaluation example.
    
    Inputs:
    - `actions`: integer action labels observed or assigned in the policy example.
    - `n_actions`: number of possible actions in the off-policy evaluation action set.
    Returns: a binary indicator matrix for the supplied action labels."""
    mat = np.zeros((len(actions), n_actions))
    mat[np.arange(len(actions)), np.asarray(actions, dtype=int)] = 1.0
    return mat


def make_dag(edges, title=None, node_colors=None, rankdir="LR"):
    """Construct the dag object.
    
    Inputs:
    - `edges`: directed or undirected graph edges represented as node pairs.
    - `title`: plot or graph title displayed above the figure.
    - `node_colors`: optional mapping from graph nodes to fill colors.
    - `rankdir`: Graphviz layout direction, such as left-to-right or top-to-bottom.
    Returns: a Graphviz diagram for the causal structure in the section."""
    graph = graphviz.Digraph(format="svg")
    graph.attr(rankdir=rankdir, bgcolor="white", pad="0.2")
    graph.attr("node", shape="box", style="rounded,filled", color="#334155", fontname="Helvetica")
    graph.attr("edge", color="#475569", arrowsize="0.8")

    node_colors = node_colors or {}
    nodes = sorted({node for edge in edges for node in edge[:2]})
    for node in nodes:
        graph.node(node, fillcolor=node_colors.get(node, "#eef2ff"))

    for edge in edges:
        if len(edge) == 2:
            graph.edge(edge[0], edge[1])
        else:
            graph.edge(edge[0], edge[1], label=edge[2])

    if title:
        graph.attr(label=title, labelloc="t", fontsize="18", fontname="Helvetica-Bold")
    return graph


def effective_sample_size(weights):
    """Compute effective sample size for the off policy evaluation example.
    
    Inputs:
    - `weights`: nonnegative weights used for weighted means, variances, or effective sample size.
    Returns: an effect-estimate summary for the causal comparison in the section."""
    weights = np.asarray(weights)
    if np.sum(weights**2) == 0:
        return 0.0
    return (np.sum(weights) ** 2) / np.sum(weights**2)

1. The OPE Problem

A logged decision dataset contains triples:

\[ (X_i, A_i, R_i) \]

where:

\(X_i\) is the context available at decision time.
\(A_i\) is the action chosen by the logging policy.
\(R_i\) is the observed reward for the chosen action.

The key difficulty is partial feedback. We observe the reward for the action actually taken, but not the rewards that would have occurred under the other actions.

Let the historical logging policy be:

\[ \mu(a \mid x) \]

and let the new target policy be:

\[ \pi(a \mid x) \]

The policy value is:

\[ V(\pi) = E\left[\sum_{a} \pi(a \mid X)Y(a)\right] \]

OPE estimates \(V(\pi)\) using data collected under \(\mu\), without deploying \(\pi\).

# Organize the calculations for this section into readable intermediate steps.
make_dag(
    edges=[
        ("ContextX", "LoggingPolicy"),
        ("LoggingPolicy", "LoggedAction"),
        ("ContextX", "PotentialRewards"),
        ("LoggedAction", "ObservedReward"),
        ("PotentialRewards", "ObservedReward"),
        ("ContextX", "TargetPolicy"),
        ("TargetPolicy", "CounterfactualAction"),
        ("LoggedData", "OPEEstimator"),
        ("TargetPolicy", "OPEEstimator"),
        ("OPEEstimator", "EstimatedPolicyValue"),
    ],
    title="Off-policy evaluation estimates a target policy from logged decisions",
    node_colors={
        "ContextX": "#dbeafe",
        "LoggingPolicy": "#fef3c7",
        "LoggedAction": "#fee2e2",
        "PotentialRewards": "#dcfce7",
        "ObservedReward": "#f1f5f9",
        "TargetPolicy": "#ede9fe",
        "CounterfactualAction": "#cffafe",
        "LoggedData": "#e0f2fe",
        "OPEEstimator": "#fce7f3",
        "EstimatedPolicyValue": "#ecfccb",
    },
)

Dudík et al. (2014) present doubly robust policy evaluation and optimization for contextual bandits. Horvitz and Thompson (1952) is the classical origin of inverse-probability weighting. Modern OPE work studies the bias-variance tradeoff among direct modeling, importance weighting, self-normalization, clipping, and doubly robust estimators (Su et al., 2019; Zhan et al., 2021; Saito & Joachims, 2022).

2. Identification Conditions

For the basic contextual bandit OPE estimators in this notebook, the important conditions are:

Consistency: the observed reward equals the potential reward for the action actually taken.
Conditional exchangeability: after conditioning on logged context \(X\), action assignment is as good as random according to the logging policy.
Positivity or overlap. If the target policy may choose action \(a\) at context \(x\), then the logging policy must have nonzero probability of choosing that action.

\[ \pi(a \mid x) > 0 \Rightarrow \mu(a \mid x) > 0 \]

Correct logging probabilities: the analyst must know or accurately estimate \(\mu(A_i \mid X_i)\).

OPE is not a shortcut around identification. It is a way to use randomized or conditionally randomized historical logs efficiently.

3. Running Example: Offline Evaluation of Account Interventions

Imagine a SaaS company deciding among four account-level interventions:

No action.
Send educational lifecycle email.
Offer a limited discount.
Assign a human customer-success call.

The company has historical logs from a stochastic decision policy. That policy used account context and deliberately kept some exploration, so every action had at least some chance of being chosen. Now the team wants to evaluate new targeting policies offline before running a live experiment.

In this simulation, we observe the true potential reward functions only because we created the data. Real OPE projects do not have that luxury; simulation truth is used here to check estimator behavior.

# Generate the teaching data and keep the causal quantities needed for diagnostics.
def simulate_account_intervention_logs(n=12000, seed=5707):
    """Simulate the account intervention logs data-generating process.
    
    Inputs:
    - `n`: number of simulated units or records to generate for the teaching example.
    - `seed`: integer random seed that makes the simulation or model split reproducible.
    Returns: a pandas DataFrame with generated covariates, treatment or policy variables, outcomes, and teaching diagnostics for this scenario."""
    rng = np.random.default_rng(seed)

    churn_risk = rng.beta(2.1, 2.4, n)
    product_health = rng.beta(2.5, 2.0, n)
    email_engagement = rng.beta(1.7, 2.4, n)
    price_sensitivity = rng.beta(2.0, 2.6, n)
    usage_depth = rng.normal(0, 1, n)
    log_mrr = rng.normal(4.45, 0.65, n)
    mrr_z = (log_mrr - log_mrr.mean()) / log_mrr.std()
    enterprise = rng.binomial(1, sigmoid(-0.35 + 0.9 * mrr_z + 0.45 * churn_risk), n)
    strategic = rng.binomial(1, sigmoid(-1.4 + 1.0 * enterprise + 0.55 * mrr_z), n)
    onboarding_gap = rng.beta(1.8, 3.0, n)
    support_load = rng.poisson(np.clip(0.3 + 3.2 * churn_risk - 0.8 * product_health + 0.7 * enterprise, 0.05, None))
    region_west = rng.binomial(1, 0.32, n)
    region_emea = rng.binomial(1, 0.27, n) * (1 - region_west)

    baseline = (
        58
        + 11.0 * mrr_z
        + 18.0 * product_health
        - 24.0 * churn_risk
        + 5.5 * enterprise
        + 3.0 * strategic
        + 3.2 * usage_depth
        - 1.6 * support_load
        - 7.0 * onboarding_gap
        + 2.4 * region_west
        - 1.4 * region_emea
        + 3.0 * np.sin(usage_depth)
    )

    effect_no_action = np.zeros(n)
    effect_email = (
        1.5
        + 10.0 * onboarding_gap
        + 8.0 * email_engagement * (1 - product_health)
        + 3.5 * (usage_depth < -0.5)
        - 2.2 * enterprise
        - 1.0 * np.maximum(support_load - 4, 0)
    )
    effect_discount = (
        -1.0
        + 19.0 * price_sensitivity
        + 8.0 * churn_risk
        - 6.0 * product_health
        - 4.5 * enterprise
        - 2.0 * strategic
    )
    effect_call = (
        -7.0
        + 23.0 * enterprise * churn_risk
        + 9.0 * strategic
        + 12.0 * onboarding_gap
        + 4.0 * (support_load <= 3)
        - 6.0 * (product_health > 0.72)
        + 2.0 * region_emea
    )

    mu = np.column_stack(
        [
            baseline + effect_no_action,
            baseline + effect_email,
            baseline + effect_discount,
            baseline + effect_call,
        ]
    )

    logging_utilities = np.column_stack(
        [
            0.45 + 0.75 * product_health - 0.80 * churn_risk,
            0.10 + 1.35 * onboarding_gap + 0.75 * email_engagement - 0.30 * enterprise,
            -0.15 + 1.55 * price_sensitivity + 0.85 * churn_risk - 0.40 * strategic,
            -0.75 + 1.15 * enterprise + 0.85 * strategic + 1.15 * churn_risk + 0.18 * support_load,
        ]
    )
    logging_probs = row_softmax(logging_utilities)
    exploration = 0.08
    logging_probs = (1 - exploration) * logging_probs + exploration / N_ACTIONS

    actions = np.array([rng.choice(N_ACTIONS, p=probs) for probs in logging_probs])
    reward = mu[np.arange(n), actions] + rng.normal(0, 7.0, n)

    df = pd.DataFrame(
        {
            "account_id": np.arange(1, n + 1),
            "churn_risk": churn_risk,
            "product_health": product_health,
            "email_engagement": email_engagement,
            "price_sensitivity": price_sensitivity,
            "usage_depth": usage_depth,
            "log_mrr": log_mrr,
            "mrr_z": mrr_z,
            "enterprise": enterprise,
            "strategic": strategic,
            "onboarding_gap": onboarding_gap,
            "support_load": support_load,
            "region_west": region_west,
            "region_emea": region_emea,
            "action": actions,
            "action_label": [ACTION_LABELS[a] for a in actions],
            "reward": reward,
            "logging_prob_chosen": logging_probs[np.arange(n), actions],
        }
    )
    for action_id, label in ACTION_LABELS.items():
        safe_label = label.lower().replace(" ", "_")
        df[f"mu_{action_id}"] = mu[:, action_id]
        df[f"logging_prob_{action_id}"] = logging_probs[:, action_id]
        df[f"incremental_vs_no_action_{safe_label}"] = mu[:, action_id] - mu[:, 0]

    return df


df = simulate_account_intervention_logs()
df.head()

	account_id	churn_risk	product_health	email_engagement	price_sensitivity	usage_depth	log_mrr	mrr_z	enterprise	strategic	...	mu_1	logging_prob_1	incremental_vs_no_action_email_education	mu_2	logging_prob_2	incremental_vs_no_action_discount_offer	mu_3	logging_prob_3	incremental_vs_no_action_human_success_call
0	1	0.420	0.692	0.537	0.129	-0.286	5.600	1.785	0	0	...	81.913	0.420	8.268	74.307	0.194	0.662	77.178	0.147	3.533
1	2	0.390	0.796	0.509	0.294	0.424	4.775	0.512	0	0	...	72.310	0.282	3.645	71.603	0.276	2.938	61.241	0.140	-7.424
2	3	0.780	0.325	0.352	0.485	-0.314	4.385	-0.090	0	0	...	37.010	0.193	3.767	45.753	0.375	12.509	27.883	0.305	-5.360
3	4	0.313	0.369	0.850	0.663	0.553	4.939	0.765	1	1	...	76.399	0.210	8.984	72.796	0.145	5.381	89.090	0.529	21.675
4	5	0.009	0.611	0.781	0.194	-1.266	3.769	-1.039	0	0	...	61.331	0.423	10.964	49.460	0.167	-0.906	51.606	0.080	1.240

5 rows × 30 columns

At this point, ask what the result lets us defend. Logged data can evaluate a new policy only where the old policy created enough support.

# Assemble the summary table used to interpret the causal comparison.
feature_cols = [
    "churn_risk",
    "product_health",
    "email_engagement",
    "price_sensitivity",
    "usage_depth",
    "mrr_z",
    "enterprise",
    "strategic",
    "onboarding_gap",
    "support_load",
    "region_west",
    "region_emea",
]
mu_cols = [f"mu_{a}" for a in range(N_ACTIONS)]
logging_prob_cols = [f"logging_prob_{a}" for a in range(N_ACTIONS)]

summary = pd.DataFrame(
    {
        "quantity": [
            "Logged decisions",
            "Average observed reward",
            "Average chosen-action propensity",
            "Minimum chosen-action propensity",
            "Action with highest average true reward",
            "Average value under logging policy",
        ],
        "value": [
            len(df),
            df["reward"].mean(),
            df["logging_prob_chosen"].mean(),
            df["logging_prob_chosen"].min(),
            ACTION_LABELS[np.argmax(df[mu_cols].mean().to_numpy())],
            np.mean((df[logging_prob_cols].to_numpy() * df[mu_cols].to_numpy()).sum(axis=1)),
        ],
    }
)
display(summary)

	quantity	value
0	Logged decisions	12000
1	Average observed reward	62.761
2	Average chosen-action propensity	0.316
3	Minimum chosen-action propensity	0.045
4	Action with highest average true reward	Human success call
5	Average value under logging policy	62.824

In an applied workflow, the main point is direct. The behavior policy is part of the evidence, so weights and support deserve as much attention as the value estimate.

fig, axes = plt.subplots(1, 3, figsize=(15, 4.2))

sns.countplot(data=df, x="action_label", order=list(ACTION_LABELS.values()), color="#2563eb", ax=axes[0])
axes[0].set_title("Logged action distribution")
axes[0].set_xlabel("")
axes[0].tick_params(axis="x", rotation=35)

sns.histplot(df["logging_prob_chosen"], bins=35, kde=True, color="#0f766e", ax=axes[1])
axes[1].set_title("Chosen-action propensities")
axes[1].set_xlabel("Logging probability for chosen action")

true_action_means = df[mu_cols].mean().rename(index={f"mu_{a}": ACTION_LABELS[a] for a in range(N_ACTIONS)})
axes[2].bar(true_action_means.index, true_action_means.values, color="#7c3aed", alpha=0.82)
axes[2].set_title("Average true reward by action")
axes[2].set_ylabel("Expected reward")
axes[2].tick_params(axis="x", rotation=35)

plt.tight_layout()
plt.show()

This setup is a contextual bandit:

The system sees context.
It chooses one action.
It observes only that action’s reward.
The rewards for the unchosen actions are missing.

The logging policy is stochastic and stores propensities. That is the feature that makes reliable OPE possible. If the logs only stored the chosen action and reward, we would have a much weaker evaluation problem.

4. Candidate Target Policies

We will evaluate several target policies:

Logging policy: the policy that generated the data.
Uniform random policy: a sanity-check policy.
Business rule: a transparent hand-built rule.
Model argmax policy: a deterministic policy that chooses the action with the highest predicted reward.
Conservative model policy: mostly follows the model argmax but keeps 15% uniform exploration.
Aggressive call-heavy policy: a riskier policy that uses many human calls.

The model-based policies are learned on a policy-design split. Their values are evaluated on a separate evaluation split.

# Define reusable helpers so the later analytical cells stay focused on the causal argument.
design_idx, eval_idx = train_test_split(
    df.index,
    test_size=0.45,
    random_state=5710,
    stratify=df["action"],
)
design = df.loc[design_idx].copy()
eval_logs = df.loc[eval_idx].copy()


def fit_action_reward_models(frame, feature_cols):
    """Fit the action reward models model or estimator.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `feature_cols`: list of pre-treatment covariate column names used as model features or effect modifiers.
    Returns: fitted model objects or a structured result used by the estimator in this notebook."""
    models = {}
    base_model = HistGradientBoostingRegressor(
        max_iter=180,
        learning_rate=0.045,
        min_samples_leaf=45,
        l2_regularization=0.03,
        random_state=5711,
    )
    for action_id in range(N_ACTIONS):
        action_frame = frame[frame["action"] == action_id]
        model = clone(base_model)
        model.fit(action_frame[feature_cols], action_frame["reward"])
        models[action_id] = model
    return models


def predict_action_rewards(models, frame, feature_cols):
    """Generate action rewards predictions.
    
    Inputs:
    - `models`: collection of fitted action-specific or treatment-specific models.
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `feature_cols`: list of pre-treatment covariate column names used as model features or effect modifiers.
    Returns: predicted effects, rewards, or outcomes for the rows passed into the function."""
    preds = np.zeros((len(frame), N_ACTIONS))
    for action_id, model in models.items():
        preds[:, action_id] = model.predict(frame[feature_cols])
    return preds


policy_design_models = fit_action_reward_models(design, feature_cols)
eval_pred_rewards_for_policy = predict_action_rewards(policy_design_models, eval_logs, feature_cols)
model_argmax_actions = eval_pred_rewards_for_policy.argmax(axis=1)


def business_rule_actions(frame):
    """Apply the interpretable business-rule policy to each account.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    Returns: an integer action array assigning each row to no action, email, discount, or call."""
    action = np.zeros(len(frame), dtype=int)
    call = (frame["enterprise"].to_numpy() == 1) & (frame["churn_risk"].to_numpy() > 0.58)
    discount = (~call) & (frame["price_sensitivity"].to_numpy() > 0.67) & (frame["churn_risk"].to_numpy() > 0.42)
    email = (~call) & (~discount) & (
        (frame["onboarding_gap"].to_numpy() > 0.55) | (frame["email_engagement"].to_numpy() > 0.65)
    )
    action[email] = 1
    action[discount] = 2
    action[call] = 3
    return action


def aggressive_call_actions(frame):
    """Apply the aggressive call policy used as an off-policy benchmark.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    Returns: an integer action array that escalates more accounts to the call intervention."""
    action = np.ones(len(frame), dtype=int)
    discount = frame["price_sensitivity"].to_numpy() > 0.55
    call = (frame["enterprise"].to_numpy() == 1) | (frame["churn_risk"].to_numpy() > 0.48)
    action[discount] = 2
    action[call] = 3
    return action


def conservative_probs(actions, epsilon=0.15):
    """Convert deterministic actions into an epsilon-smoothed target policy.
    
    Inputs:
    - `actions`: integer action labels observed or assigned in the policy example.
    - `epsilon`: small exploration probability used to keep every action possible under the policy.
    Returns: a probability matrix with most mass on the chosen action and small mass on alternatives."""
    return (1 - epsilon) * one_hot(actions) + epsilon / N_ACTIONS


target_policies = {
    "Logging policy": eval_logs[logging_prob_cols].to_numpy(),
    "Uniform random": np.ones((len(eval_logs), N_ACTIONS)) / N_ACTIONS,
    "Business rule": one_hot(business_rule_actions(eval_logs)),
    "Model argmax": one_hot(model_argmax_actions),
    "Conservative model": conservative_probs(model_argmax_actions, epsilon=0.15),
    "Aggressive call-heavy": one_hot(aggressive_call_actions(eval_logs)),
}

policy_action_mix = []
for policy_name, probs in target_policies.items():
    action_rates = probs.mean(axis=0)
    row = {"policy": policy_name}
    for action_id, label in ACTION_LABELS.items():
        row[label] = action_rates[action_id]
    policy_action_mix.append(row)

display(pd.DataFrame(policy_action_mix))

	policy	No action	Email education	Discount offer	Human success call
0	Logging policy	0.189	0.248	0.253	0.310
1	Uniform random	0.250	0.250	0.250	0.250
2	Business rule	0.503	0.276	0.065	0.156
3	Model argmax	0.008	0.291	0.242	0.459
4	Conservative model	0.044	0.285	0.244	0.427
5	Aggressive call-heavy	0.000	0.204	0.092	0.704

The action mix is an immediate deployment diagnostic. A policy that shifts heavily toward an expensive or operationally scarce action may have a high estimated reward but still be infeasible.

It is also an OPE diagnostic. The more different the target policy is from the logging policy, the harder offline evaluation becomes.

5. Ground Truth Value in Simulation

In real life, policy value is unknown. Here we can compute it because the simulation contains all potential reward means:

\[ V(\pi) = \frac{1}{n} \sum_{i} \sum_{a} \pi(a \mid X_i)\mu_a(X_i) \]

where \(\mu_a(X_i)=E[Y_i(a)\mid X_i]\).

This simulation truth lets us separate two questions:

What does each OPE estimator say?
How close is that estimate to the value we know from the data-generating process?

def true_policy_value(frame, target_probs):
    """Compute the true value of a target policy using simulated potential rewards.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `target_probs`: target-policy action probabilities for each row in the logged dataset.
    Returns: a scalar mean reward under the target action probabilities."""
    return np.mean(np.sum(target_probs * frame[mu_cols].to_numpy(), axis=1))


truth_table = pd.DataFrame(
    [
        {
            "policy": name,
            "true_value": true_policy_value(eval_logs, probs),
            "true_gain_vs_logging": true_policy_value(eval_logs, probs)
            - true_policy_value(eval_logs, target_policies["Logging policy"]),
        }
        for name, probs in target_policies.items()
    ]
).sort_values("true_value", ascending=False)

display(truth_table)

	policy	true_value	true_gain_vs_logging
3	Model argmax	67.325	4.533
4	Conservative model	66.276	3.484
5	Aggressive call-heavy	65.291	2.499
0	Logging policy	62.792	0.000
2	Business rule	61.850	-0.942
1	Uniform random	60.333	-2.459

6. Why Naive Offline Evaluation Fails

A naive matched-outcome estimate keeps only rows where the logged action matches the target policy action and averages their observed rewards.

For deterministic target policies:

\[ \hat{V}_{matched} = \frac{\sum_{i} 1\{A_i = \pi(X_i)\}R_i} {\sum_{i} 1\{A_i = \pi(X_i)\}} \]

This can be badly biased when the logging policy is context-dependent. The matched rows are not a random sample of all contexts. They are the contexts where the logging policy happened to take the same action as the target policy.

Proper OPE accounts for the probability of the observed action under the logging policy.

def deterministic_actions_from_probs(probs):
    """Convert a target-policy probability matrix into chosen actions.
    
    Inputs:
    - `probs`: row-wise action probability matrix used to choose deterministic policy actions.
    Returns: an integer array selecting the highest-probability action for each row."""
    is_deterministic = np.all((np.isclose(probs, 0) | np.isclose(probs, 1)), axis=1)
    if not np.all(is_deterministic):
        return None
    return probs.argmax(axis=1)


naive_rows = []
for policy_name, probs in target_policies.items():
    target_actions = deterministic_actions_from_probs(probs)
    if target_actions is None:
        continue
    matched = eval_logs["action"].to_numpy() == target_actions
    naive_rows.append(
        {
            "policy": policy_name,
            "matched_share": matched.mean(),
            "naive_matched_reward": eval_logs.loc[matched, "reward"].mean(),
            "true_value": true_policy_value(eval_logs, probs),
            "naive_bias": eval_logs.loc[matched, "reward"].mean() - true_policy_value(eval_logs, probs),
        }
    )

display(pd.DataFrame(naive_rows).sort_values("naive_bias"))

	policy	matched_share	naive_matched_reward	true_value	naive_bias
0	Business rule	0.323	63.967	61.850	2.117
1	Model argmax	0.402	69.657	67.325	2.332
2	Aggressive call-heavy	0.377	69.453	65.291	4.162

The matched-outcome estimator is tempting because it feels like “look at cases where we did the same thing.” But if the old system took that action only for certain kinds of accounts, the matched subset does not represent the population the new policy would touch.

7. OPE Estimators

Let \(p_i = \mu(A_i \mid X_i)\) be the logged propensity of the action actually taken.

Let \(q_i = \pi(A_i \mid X_i)\) be the target policy probability of that same action.

The importance weight is:

\[ w_i = \frac{q_i}{p_i} \]

Inverse propensity scoring estimates:

\[ \hat{V}_{IPS} = \frac{1}{n}\sum_{i} w_i R_i \]

Self-normalized IPS estimates:

\[ \hat{V}_{SNIPS} = \frac{\sum_{i} w_i R_i}{\sum_{i} w_i} \]

The direct method estimates a reward model \(\hat{\mu}_a(x)\) and plugs it in:

\[ \hat{V}_{DM} = \frac{1}{n}\sum_{i}\sum_{a} \pi(a \mid X_i)\hat{\mu}_a(X_i) \]

The doubly robust estimator combines both:

\[ \hat{V}_{DR} = \frac{1}{n}\sum_{i} \left[ \sum_{a} \pi(a \mid X_i)\hat{\mu}_a(X_i) + \frac{\pi(A_i \mid X_i)}{\mu(A_i \mid X_i)} \{R_i-\hat{\mu}_{A_i}(X_i)\} \right] \]

DR is attractive because the reward model supplies baseline predictions while the weighted residual corrects the model using logged outcomes.

# Fit the models for this section and assemble the estimates used in the discussion.
def crossfit_reward_matrix(frame, feature_cols, n_splits=5, seed=5712):
    """Build cross-fitted reward matrix quantities.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `feature_cols`: list of pre-treatment covariate column names used as model features or effect modifiers.
    - `n_splits`: number of folds used for cross-fitting so rows are predicted out of sample.
    - `seed`: integer random seed that makes the simulation or model split reproducible.
    Returns: out-of-fold predictions or scores built without using each row to predict itself."""
    n = len(frame)
    mu_hat = np.zeros((n, N_ACTIONS))
    fold_id = np.zeros(n, dtype=int)
    base_model = HistGradientBoostingRegressor(
        max_iter=180,
        learning_rate=0.045,
        min_samples_leaf=40,
        l2_regularization=0.025,
        random_state=seed,
    )

    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    for fold, (train_idx, test_idx) in enumerate(kfold.split(frame), start=1):
        train = frame.iloc[train_idx]
        test = frame.iloc[test_idx]
        for action_id in range(N_ACTIONS):
            action_train = train[train["action"] == action_id]
            model = clone(base_model)
            model.fit(action_train[feature_cols], action_train["reward"])
            mu_hat[test_idx, action_id] = model.predict(test[feature_cols])
        fold_id[test_idx] = fold
    return mu_hat, fold_id


mu_hat_eval, fold_id_eval = crossfit_reward_matrix(eval_logs, feature_cols)
eval_logs = eval_logs.copy()
eval_logs["fold"] = fold_id_eval

chosen_actions = eval_logs["action"].to_numpy()
chosen_mu_hat = mu_hat_eval[np.arange(len(eval_logs)), chosen_actions]
reward_rmse_chosen = np.sqrt(mean_squared_error(eval_logs["reward"], chosen_mu_hat))
potential_rmse = np.sqrt(mean_squared_error(eval_logs[mu_cols].to_numpy().ravel(), mu_hat_eval.ravel()))

pd.DataFrame(
    {
        "diagnostic": [
            "Out-of-fold reward RMSE on observed actions",
            "Out-of-fold reward RMSE against all simulated potential means",
        ],
        "value": [reward_rmse_chosen, potential_rmse],
    }
)

	diagnostic	value
0	Out-of-fold reward RMSE on observed actions	7.897
1	Out-of-fold reward RMSE against all simulated ...	3.663

This result should affect the next modeling or decision step. Offline policy value is credible only when the log contains comparable actions in comparable contexts.

# Assemble the summary table used to interpret the causal comparison.
def evaluate_ope_estimators(frame, target_probs, mu_hat, logging_probs=None, clip=12.0):
    """Evaluate the ope estimators calculation.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `target_probs`: target-policy action probabilities for each row in the logged dataset.
    - `mu_hat`: estimated reward matrix used as the outcome model in doubly robust evaluation.
    - `logging_probs`: behavior-policy probabilities for the actions observed in the logged data.
    - `clip`: maximum importance weight used to stabilize high-variance off-policy estimates.
    Returns: policy or estimator performance summaries for the candidate decision rule."""
    if logging_probs is None:
        logging_probs = frame[logging_prob_cols].to_numpy()
    action = frame["action"].to_numpy()
    reward = frame["reward"].to_numpy()
    n = len(frame)

    p_chosen = logging_probs[np.arange(n), action]
    q_chosen = target_probs[np.arange(n), action]
    weights = q_chosen / np.clip(p_chosen, 1e-6, None)
    clipped_weights = np.minimum(weights, clip)

    dm_row = np.sum(target_probs * mu_hat, axis=1)
    residual = reward - mu_hat[np.arange(n), action]

    ips_contrib = weights * reward
    clipped_ips_contrib = clipped_weights * reward
    dr_contrib = dm_row + weights * residual
    clipped_dr_contrib = dm_row + clipped_weights * residual

    estimators = {
        "Direct method": dm_row,
        "IPS": ips_contrib,
        "Clipped IPS": clipped_ips_contrib,
        "Doubly robust": dr_contrib,
        "Clipped DR": clipped_dr_contrib,
    }

    rows = []
    for name, contrib in estimators.items():
        rows.append(
            {
                "estimator": name,
                "estimate": np.mean(contrib),
                "std_error": np.std(contrib, ddof=1) / np.sqrt(n),
            }
        )

    if weights.sum() > 0:
        snips_estimate = np.sum(weights * reward) / np.sum(weights)
        snips_if = weights * (reward - snips_estimate) / np.mean(weights)
        rows.append(
            {
                "estimator": "SNIPS",
                "estimate": snips_estimate,
                "std_error": np.std(snips_if, ddof=1) / np.sqrt(n),
            }
        )

    return pd.DataFrame(rows), weights


all_ope_rows = []
weight_rows = []
for policy_name, probs in target_policies.items():
    estimates, weights = evaluate_ope_estimators(eval_logs, probs, mu_hat_eval, clip=12.0)
    truth = true_policy_value(eval_logs, probs)
    estimates["policy"] = policy_name
    estimates["true_value"] = truth
    estimates["bias_vs_truth"] = estimates["estimate"] - truth
    estimates["ci_low"] = estimates["estimate"] - 1.96 * estimates["std_error"]
    estimates["ci_high"] = estimates["estimate"] + 1.96 * estimates["std_error"]
    all_ope_rows.append(estimates)

    weight_rows.append(
        {
            "policy": policy_name,
            "mean_weight": weights.mean(),
            "max_weight": weights.max(),
            "share_zero_weight": np.mean(weights == 0),
            "share_weight_gt_10": np.mean(weights > 10),
            "effective_sample_size": effective_sample_size(weights),
            "ess_fraction": effective_sample_size(weights) / len(weights),
        }
    )

ope_results = pd.concat(all_ope_rows, ignore_index=True)
weight_diagnostics = pd.DataFrame(weight_rows).sort_values("ess_fraction")

display(weight_diagnostics)

	policy	mean_weight	max_weight	share_zero_weight	share_weight_gt_10	effective_sample_size	ess_fraction
2	Business rule	1.048	13.851	0.677	0.001	1416.516	0.262
5	Aggressive call-heavy	1.013	10.549	0.623	0.000	1668.987	0.309
3	Model argmax	1.012	8.882	0.598	0.000	1939.358	0.359
4	Conservative model	1.009	7.882	0.000	0.000	2420.539	0.448
1	Uniform random	0.991	4.620	0.000	0.000	4238.381	0.785
0	Logging policy	1.000	1.000	0.000	0.000	5400.000	1.000

The effective sample size is a practical OPE warning light:

\[ ESS = \frac{(\sum_{i} w_i)^2}{\sum_{i} w_i^2} \]

A low ESS means the estimate is driven by a small number of heavily weighted observations. In that situation, the point estimate may be unstable even if the dataset is large.

display(
    ope_results[
        ["policy", "estimator", "estimate", "std_error", "ci_low", "ci_high", "true_value", "bias_vs_truth"]
    ].sort_values(["policy", "estimator"])
)

	policy	estimator	estimate	std_error	ci_low	ci_high	true_value	bias_vs_truth
34	Aggressive call-heavy	Clipped DR	65.115	0.327	64.475	65.756	65.291	-0.176
32	Aggressive call-heavy	Clipped IPS	65.728	1.283	63.213	68.242	65.291	0.436
30	Aggressive call-heavy	Direct method	65.303	0.259	64.796	65.810	65.291	0.012
33	Aggressive call-heavy	Doubly robust	65.115	0.327	64.475	65.756	65.291	-0.176
31	Aggressive call-heavy	IPS	65.728	1.283	63.213	68.242	65.291	0.436
35	Aggressive call-heavy	SNIPS	64.860	0.533	63.815	65.904	65.291	-0.432
16	Business rule	Clipped DR	61.910	0.314	61.296	62.525	61.850	0.060
14	Business rule	Clipped IPS	63.968	1.473	61.082	66.855	61.850	2.118
12	Business rule	Direct method	61.787	0.223	61.349	62.224	61.850	-0.064
15	Business rule	Doubly robust	61.912	0.314	61.297	62.526	61.850	0.061
13	Business rule	IPS	63.992	1.475	61.101	66.883	61.850	2.141
17	Business rule	SNIPS	61.081	0.478	60.145	62.018	61.850	-0.769
28	Conservative model	Clipped DR	66.319	0.272	65.786	66.853	66.276	0.043
26	Conservative model	Clipped IPS	67.100	1.056	65.030	69.170	66.276	0.823
24	Conservative model	Direct method	66.332	0.223	65.895	66.769	66.276	0.056
27	Conservative model	Doubly robust	66.319	0.272	65.786	66.853	66.276	0.043
25	Conservative model	IPS	67.100	1.056	65.030	69.170	66.276	0.823
29	Conservative model	SNIPS	66.505	0.357	65.806	67.205	66.276	0.229
4	Logging policy	Clipped DR	62.780	0.244	62.301	63.259	62.792	-0.012
2	Logging policy	Clipped IPS	62.813	0.254	62.315	63.311	62.792	0.021
0	Logging policy	Direct method	62.849	0.215	62.428	63.270	62.792	0.057
3	Logging policy	Doubly robust	62.780	0.244	62.301	63.259	62.792	-0.012
1	Logging policy	IPS	62.813	0.254	62.315	63.311	62.792	0.021
5	Logging policy	SNIPS	62.813	0.254	62.315	63.311	62.792	0.021
22	Model argmax	Clipped DR	67.384	0.286	66.825	67.944	67.325	0.059
20	Model argmax	Clipped IPS	68.387	1.260	65.917	70.857	67.325	1.062
18	Model argmax	Direct method	67.382	0.228	66.935	67.828	67.325	0.056
21	Model argmax	Doubly robust	67.384	0.286	66.825	67.944	67.325	0.059
19	Model argmax	IPS	68.387	1.260	65.917	70.857	67.325	1.062
23	Model argmax	SNIPS	67.566	0.400	66.782	68.349	67.325	0.240
10	Uniform random	Clipped DR	60.284	0.243	59.808	60.760	60.333	-0.049
8	Uniform random	Clipped IPS	59.805	0.459	58.904	60.705	60.333	-0.529
6	Uniform random	Direct method	60.384	0.204	59.984	60.783	60.333	0.050
9	Uniform random	Doubly robust	60.284	0.243	59.808	60.760	60.333	-0.049
7	Uniform random	IPS	59.805	0.459	58.904	60.705	60.333	-0.529
11	Uniform random	SNIPS	60.368	0.271	59.836	60.900	60.333	0.035

Here the method has to be judged against the design and the evidence. Logged data can evaluate a new policy only where the old policy created enough support.

# Build the visualization for the estimates or diagnostics computed above.
fig, axes = plt.subplots(1, 2, figsize=(15, 5.2))

selected_estimators = ["Direct method", "IPS", "SNIPS", "Doubly robust", "Clipped DR"]
plot_results = ope_results[ope_results["estimator"].isin(selected_estimators)].copy()

sns.scatterplot(
    data=plot_results,
    x="true_value",
    y="estimate",
    hue="estimator",
    style="policy",
    s=85,
    ax=axes[0],
)
min_value = min(plot_results["true_value"].min(), plot_results["estimate"].min()) - 2
max_value = max(plot_results["true_value"].max(), plot_results["estimate"].max()) + 2
axes[0].plot([min_value, max_value], [min_value, max_value], color="#111827", linestyle="--")
axes[0].set_title("OPE estimates versus simulation truth")
axes[0].set_xlabel("True policy value")
axes[0].set_ylabel("Estimated policy value")

bias_plot = plot_results.copy()
bias_plot["policy_estimator"] = bias_plot["policy"] + " | " + bias_plot["estimator"]
sns.barplot(
    data=bias_plot,
    y="policy",
    x="bias_vs_truth",
    hue="estimator",
    ax=axes[1],
)
axes[1].axvline(0, color="#111827", linewidth=1.1)
axes[1].set_title("Estimator bias in the simulation")
axes[1].set_xlabel("Estimate minus true value")
axes[1].set_ylabel("")

plt.tight_layout()
plt.show()

The comparison shows the central OPE tradeoff:

Direct method can have low variance but can be biased when the reward model is wrong.
IPS is less model-dependent but can have high variance when target and logging policies disagree.
SNIPS often stabilizes IPS but introduces small finite-sample bias.
DR often performs well because it combines a reward model with propensity correction.
Clipping reduces variance but adds bias.

No estimator dominates in every problem. The best estimator depends on overlap, reward-model quality, action-space size, and how far the target policy moves from the logging policy.

8. Inspecting Importance Weights

Importance weights are the mechanical heart of IPS and DR:

\[ w_i = \frac{\pi(A_i \mid X_i)}{\mu(A_i \mid X_i)} \]

Large weights mean the target policy puts much more probability on the logged action than the logging policy did. Zero weights mean the logged action would not be chosen by the target policy.

Weight diagnostics are not optional. They tell us whether the logged data are informative for the proposed target policy.

# Fit the models for this section and assemble the estimates used in the discussion.
weight_plot_rows = []
for policy_name, probs in target_policies.items():
    _, weights = evaluate_ope_estimators(eval_logs, probs, mu_hat_eval, clip=12.0)
    if policy_name in ["Business rule", "Model argmax", "Conservative model", "Aggressive call-heavy"]:
        sample_weights = pd.DataFrame({"policy": policy_name, "weight": weights})
        weight_plot_rows.append(sample_weights)

weight_plot = pd.concat(weight_plot_rows, ignore_index=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 4.8))

sns.histplot(
    data=weight_plot,
    x="weight",
    hue="policy",
    bins=45,
    common_norm=False,
    stat="density",
    alpha=0.38,
    ax=axes[0],
)
axes[0].set_xlim(0, 25)
axes[0].set_title("Importance-weight distributions")
axes[0].set_xlabel("Importance weight")

sns.barplot(
    data=weight_diagnostics,
    y="policy",
    x="ess_fraction",
    color="#2563eb",
    ax=axes[1],
)
axes[1].set_title("Effective sample size fraction")
axes[1].set_xlabel("ESS / n")
axes[1].set_ylabel("")

plt.tight_layout()
plt.show()

The aggressive policy is the kind of policy that often makes OPE uncomfortable. It may be operationally attractive, but if historical logs rarely used the same actions in similar contexts, then the estimate depends on a thin slice of the data.

An OPE readout should not simply rank policies by point estimate. It should also report whether each estimate is supported by the logs.

9. Clipping Sensitivity

Clipping caps the largest importance weights:

\[ w_i^{clip} = \min(w_i, c) \]

This reduces variance but changes the estimand. A large shift in the estimate across clipping thresholds is a warning that the evaluation is weight-sensitive.

# Fit the models for this section and assemble the estimates used in the discussion.
clip_rows = []
clip_grid = [2, 4, 6, 8, 12, 20, 40]
for policy_name in ["Model argmax", "Conservative model", "Aggressive call-heavy"]:
    probs = target_policies[policy_name]
    truth = true_policy_value(eval_logs, probs)
    for clip_value in clip_grid:
        estimates, _ = evaluate_ope_estimators(eval_logs, probs, mu_hat_eval, clip=clip_value)
        subset = estimates[estimates["estimator"].isin(["Clipped IPS", "Clipped DR"])].copy()
        subset["clip"] = clip_value
        subset["policy"] = policy_name
        subset["true_value"] = truth
        clip_rows.append(subset)

clip_df = pd.concat(clip_rows, ignore_index=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 4.8))
for ax, estimator in zip(axes, ["Clipped IPS", "Clipped DR"]):
    sns.lineplot(
        data=clip_df[clip_df["estimator"] == estimator],
        x="clip",
        y="estimate",
        hue="policy",
        marker="o",
        ax=ax,
    )
    for policy_name in ["Model argmax", "Conservative model", "Aggressive call-heavy"]:
        truth = true_policy_value(eval_logs, target_policies[policy_name])
        ax.axhline(truth, color="#64748b", linestyle=":", linewidth=1)
    ax.set_title(f"{estimator} sensitivity to clipping")
    ax.set_xlabel("Weight cap")
    ax.set_ylabel("Estimated policy value")

plt.tight_layout()
plt.show()

Stable estimates across a reasonable clipping range are reassuring. Large movement means the offline estimate should be treated as exploratory before any launch decision.

logged data can evaluate a new policy only where the behavior policy created enough support for the actions being considered.

10. Known Versus Estimated Logging Propensities

Production OPE works best when the logging system records the exact action probabilities used at decision time.

If propensities are not logged, analysts may estimate them from historical data. That can be useful and adds another nuisance model with another source of bias. Worse, if the old policy was deterministic, the missing support cannot be repaired by estimating propensities after the fact.

Here we estimate the logging policy from the evaluation logs and compare OPE for one target policy using known versus estimated propensities.

# Fit the models for this section and assemble the estimates used in the discussion.
def crossfit_logging_propensities(frame, feature_cols, n_splits=5, seed=5713):
    """Build cross-fitted logging propensities quantities.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `feature_cols`: list of pre-treatment covariate column names used as model features or effect modifiers.
    - `n_splits`: number of folds used for cross-fitting so rows are predicted out of sample.
    - `seed`: integer random seed that makes the simulation or model split reproducible.
    Returns: out-of-fold predictions or scores built without using each row to predict itself."""
    n = len(frame)
    prop_hat = np.zeros((n, N_ACTIONS))
    base_model = HistGradientBoostingClassifier(
        max_iter=180,
        learning_rate=0.045,
        min_samples_leaf=45,
        l2_regularization=0.03,
        random_state=seed,
    )
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    for train_idx, test_idx in kfold.split(frame):
        train = frame.iloc[train_idx]
        test = frame.iloc[test_idx]
        model = clone(base_model)
        model.fit(train[feature_cols], train["action"])
        pred = model.predict_proba(test[feature_cols])
        full_pred = np.zeros((len(test), N_ACTIONS))
        for col_idx, action_id in enumerate(model.classes_):
            full_pred[:, int(action_id)] = pred[:, col_idx]
        prop_hat[test_idx, :] = np.clip(full_pred, 0.02, 0.98)
        prop_hat[test_idx, :] = prop_hat[test_idx, :] / prop_hat[test_idx, :].sum(axis=1, keepdims=True)
    return prop_hat


estimated_logging_probs = crossfit_logging_propensities(eval_logs, feature_cols)
known_logging_probs = eval_logs[logging_prob_cols].to_numpy()

propensity_compare = pd.DataFrame(
    {
        "diagnostic": [
            "Mean absolute propensity error",
            "Log loss using known logging probabilities",
            "Log loss using estimated logging probabilities",
        ],
        "value": [
            np.mean(np.abs(estimated_logging_probs - known_logging_probs)),
            log_loss(eval_logs["action"], known_logging_probs, labels=list(range(N_ACTIONS))),
            log_loss(eval_logs["action"], estimated_logging_probs, labels=list(range(N_ACTIONS))),
        ],
    }
)
display(propensity_compare)

	diagnostic	value
0	Mean absolute propensity error	0.091
1	Log loss using known logging probabilities	1.258
2	Log loss using estimated logging probabilities	1.384

The point is to connect the number to the decision it supports. The behavior policy is part of the evidence, so weights and support deserve as much attention as the value estimate.

# Fit the models for this section and assemble the estimates used in the discussion.
comparison_policy = "Conservative model"
known_estimates, known_weights = evaluate_ope_estimators(
    eval_logs,
    target_policies[comparison_policy],
    mu_hat_eval,
    logging_probs=known_logging_probs,
    clip=12.0,
)
estimated_estimates, estimated_weights = evaluate_ope_estimators(
    eval_logs,
    target_policies[comparison_policy],
    mu_hat_eval,
    logging_probs=estimated_logging_probs,
    clip=12.0,
)

known_estimates["propensity_source"] = "Known logged propensity"
estimated_estimates["propensity_source"] = "Estimated propensity model"
propensity_source_comparison = pd.concat([known_estimates, estimated_estimates], ignore_index=True)
propensity_source_comparison["true_value"] = true_policy_value(eval_logs, target_policies[comparison_policy])
propensity_source_comparison["bias_vs_truth"] = (
    propensity_source_comparison["estimate"] - propensity_source_comparison["true_value"]
)

display(
    propensity_source_comparison[
        ["propensity_source", "estimator", "estimate", "std_error", "true_value", "bias_vs_truth"]
    ]
)

	propensity_source	estimator	estimate	std_error	true_value	bias_vs_truth
0	Known logged propensity	Direct method	66.332	0.223	66.276	0.056
1	Known logged propensity	IPS	67.100	1.056	66.276	0.823
2	Known logged propensity	Clipped IPS	67.100	1.056	66.276	0.823
3	Known logged propensity	Doubly robust	66.319	0.272	66.276	0.043
4	Known logged propensity	Clipped DR	66.319	0.272	66.276	0.043
5	Known logged propensity	SNIPS	66.505	0.357	66.276	0.229
6	Estimated propensity model	Direct method	66.332	0.223	66.276	0.056
7	Estimated propensity model	IPS	80.279	1.547	66.276	14.003
8	Estimated propensity model	Clipped IPS	79.725	1.486	66.276	13.449
9	Estimated propensity model	Doubly robust	66.228	0.321	66.276	-0.048
10	Estimated propensity model	Clipped DR	66.187	0.310	66.276	-0.089
11	Estimated propensity model	SNIPS	65.662	0.414	66.276	-0.615

The estimated propensity model is usable in this simulation. It is still inferior to having the true logged propensities. In real systems, logging propensities is a design requirement, not a reporting nicety.

11. Policy Contrast: Is the New Policy Better Than the Current One?

Often the real business question moves from absolute policy value to pairwise comparison.

Is the proposed policy better than the current policy?

For a target policy \(\pi\) and current policy \(\mu\), estimate:

\[ \Delta = V(\pi)-V(\mu) \]

The direct way is to estimate both values and subtract them. The uncertainty can be large because the two estimates are correlated. For a practical first pass, we can compute per-row DR contributions for both policies and take their difference.

# Fit the models for this section and assemble the estimates used in the discussion.
def dr_contribution(frame, target_probs, mu_hat, logging_probs=None):
    """Compute row-level doubly robust policy-value contributions.
    
    Inputs:
    - `frame`: pandas DataFrame containing the rows for this function and the causal variables named by the other arguments.
    - `target_probs`: target-policy action probabilities for each row in the logged dataset.
    - `mu_hat`: estimated reward matrix used as the outcome model in doubly robust evaluation.
    - `logging_probs`: behavior-policy probabilities for the actions observed in the logged data.
    Returns: a NumPy array of doubly robust scores for the target policy."""
    if logging_probs is None:
        logging_probs = frame[logging_prob_cols].to_numpy()
    action = frame["action"].to_numpy()
    reward = frame["reward"].to_numpy()
    n = len(frame)
    p_chosen = logging_probs[np.arange(n), action]
    q_chosen = target_probs[np.arange(n), action]
    weights = q_chosen / np.clip(p_chosen, 1e-6, None)
    dm_row = np.sum(target_probs * mu_hat, axis=1)
    residual = reward - mu_hat[np.arange(n), action]
    return dm_row + weights * residual


contrast_rows = []
logging_contrib = dr_contribution(eval_logs, target_policies["Logging policy"], mu_hat_eval)
for policy_name in ["Business rule", "Model argmax", "Conservative model", "Aggressive call-heavy"]:
    target_contrib = dr_contribution(eval_logs, target_policies[policy_name], mu_hat_eval)
    contrast = target_contrib - logging_contrib
    contrast_rows.append(
        {
            "policy": policy_name,
            "dr_gain_vs_logging": contrast.mean(),
            "std_error": contrast.std(ddof=1) / np.sqrt(len(contrast)),
            "ci_low": contrast.mean() - 1.96 * contrast.std(ddof=1) / np.sqrt(len(contrast)),
            "ci_high": contrast.mean() + 1.96 * contrast.std(ddof=1) / np.sqrt(len(contrast)),
            "true_gain_vs_logging": true_policy_value(eval_logs, target_policies[policy_name])
            - true_policy_value(eval_logs, target_policies["Logging policy"]),
        }
    )

contrast_table = pd.DataFrame(contrast_rows).sort_values("dr_gain_vs_logging", ascending=False)
display(contrast_table)

	policy	dr_gain_vs_logging	std_error	ci_low	ci_high	true_gain_vs_logging
1	Model argmax	4.605	0.144	4.322	4.887	4.533
2	Conservative model	3.539	0.119	3.306	3.772	3.484
3	Aggressive call-heavy	2.335	0.176	1.991	2.680	2.499
0	Business rule	-0.868	0.206	-1.272	-0.465	-0.942

Here the example becomes operational. Offline policy value is credible only when the log contains comparable actions in comparable contexts.

fig, ax = plt.subplots(figsize=(9, 4.8))
plot_contrast = contrast_table.sort_values("dr_gain_vs_logging")
ax.barh(plot_contrast["policy"], plot_contrast["dr_gain_vs_logging"], color="#2563eb", alpha=0.82)
ax.errorbar(
    x=plot_contrast["dr_gain_vs_logging"],
    y=np.arange(len(plot_contrast)),
    xerr=[
        plot_contrast["dr_gain_vs_logging"] - plot_contrast["ci_low"],
        plot_contrast["ci_high"] - plot_contrast["dr_gain_vs_logging"],
    ],
    fmt="none",
    ecolor="#111827",
    capsize=3,
)
ax.scatter(plot_contrast["true_gain_vs_logging"], np.arange(len(plot_contrast)), color="#dc2626", label="Simulation truth", zorder=5)
ax.axvline(0, color="#111827", linewidth=1.1)
ax.set_title("DR-estimated gain versus logging policy")
ax.set_xlabel("Policy value gain")
ax.legend()
plt.tight_layout()
plt.show()

This is the kind of table that belongs in an executive readout. It compares proposed policies to the current baseline, shows uncertainty, and keeps the support diagnostics nearby.

A high point estimate should still pass a support check before deployment. A policy with a slightly lower estimate but much better support may be the better next experiment.

12. Stress Test: What If the Target Policy Is Too Far Away?

OPE is strongest when the target policy is close enough to the logging policy that the logs contain relevant evidence.

If a proposed policy takes actions that the logging policy almost never took for similar contexts, then:

IPS weights become extreme.
SNIPS may look stable while estimating a support-limited quantity.
Direct method relies heavily on extrapolation.
DR inherits both reward-model risk and weight instability.

In practice, the right next step may be a small randomized exploration pilot, not a full launch.

support_summary = weight_diagnostics.merge(
    truth_table[["policy", "true_gain_vs_logging"]],
    on="policy",
    how="left",
)
support_summary["ope_risk_level"] = pd.cut(
    support_summary["ess_fraction"],
    bins=[-0.01, 0.10, 0.30, 0.60, 1.01],
    labels=["High", "Elevated", "Moderate", "Low"],
)

display(
    support_summary[
        [
            "policy",
            "true_gain_vs_logging",
            "ess_fraction",
            "max_weight",
            "share_weight_gt_10",
            "share_zero_weight",
            "ope_risk_level",
        ]
    ].sort_values("ess_fraction")
)

	policy	true_gain_vs_logging	ess_fraction	max_weight	share_weight_gt_10	share_zero_weight	ope_risk_level
0	Business rule	-0.942	0.262	13.851	0.001	0.677	Elevated
1	Aggressive call-heavy	2.499	0.309	10.549	0.000	0.623	Moderate
2	Model argmax	4.533	0.359	8.882	0.000	0.598	Moderate
3	Conservative model	3.484	0.448	7.882	0.000	0.000	Moderate
4	Uniform random	-2.459	0.785	4.620	0.000	0.000	Low
5	Logging policy	0.000	1.000	1.000	0.000	0.000	Low

13. Industry Readout Template

A professional OPE readout should be explicit about both value and reliability:

Decision: which target policy is being considered?
Logged data: time window, eligible population, action set, reward definition, and logging policy.
Identification: why historical action assignment can be treated as randomized conditional on context.
Propensities: whether exact propensities were logged or estimated after the fact.
Estimators: direct method, IPS/SNIPS, DR, and clipped variants.
Main result: estimated value and gain versus current policy.
Support diagnostics: ESS, max weights, zero-weight share, and action mix shift.
Sensitivity: clipping thresholds, reward-model choices, and target-policy stochasticity.
Recommendation: launch, reject, revise policy, or run an exploration pilot.

Good OPE is not just an estimator. It is a decision-control process for changing deployed systems safely.

# Assemble the summary table used to interpret the causal comparison.
recommended_policy = contrast_table.iloc[0]["policy"]
recommended_support = support_summary[support_summary["policy"] == recommended_policy].iloc[0]
recommended_contrast = contrast_table[contrast_table["policy"] == recommended_policy].iloc[0]

readout = pd.DataFrame(
    {
        "item": [
            "Top DR-estimated policy",
            "Estimated gain vs logging",
            "Approximate 95% CI",
            "Simulation true gain vs logging",
            "ESS fraction",
            "Maximum importance weight",
            "OPE risk level",
            "Recommended next step",
        ],
        "value": [
            recommended_policy,
            f"{recommended_contrast['dr_gain_vs_logging']:.2f}",
            f"[{recommended_contrast['ci_low']:.2f}, {recommended_contrast['ci_high']:.2f}]",
            f"{recommended_contrast['true_gain_vs_logging']:.2f}",
            f"{recommended_support['ess_fraction']:.2%}",
            f"{recommended_support['max_weight']:.1f}",
            str(recommended_support["ope_risk_level"]),
            "Use OPE as a screening result, then validate with a randomized pilot.",
        ],
    }
)
display(readout)

	item	value
0	Top DR-estimated policy	Model argmax
1	Estimated gain vs logging	4.60
2	Approximate 95% CI	[4.32, 4.89]
3	Simulation true gain vs logging	4.53
4	ESS fraction	35.91%
5	Maximum importance weight	8.9
6	OPE risk level	Moderate
7	Recommended next step	Use OPE as a screening result, then validate w...

Key Takeaways

OPE estimates the value of a target policy using data logged under another policy.
Partial feedback is the central problem: only the chosen action’s reward is observed.
IPS corrects the action-distribution mismatch with importance weights.
SNIPS often reduces variance but can introduce finite-sample bias.
Direct method depends heavily on reward-model quality.
Doubly robust estimation combines reward modeling with importance-weighted residual correction.
Weight diagnostics are central. Low ESS and large weights mean the logged data provide weak support.
Exact logged propensities are a production requirement for reliable OPE.
Clipping, shrinkage, and stochastic target policies can improve stability, but they change the bias-variance tradeoff.
OPE should usually screen policies for experimentation, not replace final validation for high-stakes decisions.

References

Dudík, M., Erhan, D., Langford, J., & Li, L. (2014). Doubly robust policy evaluation and optimization. Statistical Science, 29(4), 485-511. https://doi.org/10.1214/14-sts500

Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663-685. https://doi.org/10.1080/01621459.1952.10483446

Saito, Y., & Joachims, T. (2022). Off-policy evaluation for large action spaces via embeddings. arXiv. https://doi.org/10.48550/arxiv.2202.06317

Su, Y., Dimakopoulou, M., Krishnamurthy, A., & Dudík, M. (2019). Doubly robust off-policy evaluation with shrinkage. arXiv. https://doi.org/10.48550/arxiv.1907.09623

Zhan, R., Hadad, V., Hirshberg, D. A., & Athey, S. (2021). Off-policy evaluation via adaptive weighting with data from contextual bandits. arXiv. https://doi.org/10.48550/arxiv.2106.02029