Off-Policy Evaluation for Bandit Systems

Causal Inference
Off-Policy Evaluation
Project Lab
An applied causal inference lab on evaluating bandit policies from logged feedback.
Policy lift versus support diagnostics in off-policy evaluation
Figure 1: Policy lift compared with support diagnostics, showing why value estimates need reliability checks before deployment.

This lab studies how to evaluate a new decision policy using historical bandit logs. A logged policy chose actions, observed rewards, and stored propensities or features that let the analyst reconstruct how likely each action was under the behavior policy.

The main lesson is that off-policy evaluation is not just a value-estimation formula. It is a decision workflow. A policy with high estimated lift can still be risky when it relies on weak support, unstable weights, or narrow slices of the logged action space. The lab therefore pairs IPS, SNIPS, and doubly robust estimators with diagnostics, uncertainty summaries, and contextual policy-learning comparisons.

Lab Sequence

01. Open Bandit EDA

Introduces the logged bandit dataset, available contexts, actions, rewards, and the policy-evaluation question that motivates the rest of the lab.

02. Behavior Policy and Propensities

Reconstructs the behavior-policy information needed for OPE and studies whether candidate policies are sufficiently supported by historical logging.

03. IPS and SNIPS

Builds inverse-propensity and self-normalized estimators, then examines how weights create both the possibility and the fragility of off-policy learning.

04. Doubly Robust OPE

Combines reward modeling with propensity weighting to reduce variance and clarify the role of model fit in decision-quality OPE.

05. Policy Comparison and Sensitivity

Compares candidate policies using lift, support, clipping behavior, and uncertainty so the decision is based on robustness rather than a single point estimate.

06. Contextual Policy Learning

Learns context-dependent policies from logged feedback and evaluates whether personalization improves value without weakening support or increasing decision risk.