08. Causal Inference for Reinforcement Learning

Reinforcement Learning
Off-Policy Evaluation
Causal Inference
Lecture Notes
Sequential decisions, logged policies, bandits, off-policy evaluation, offline RL, RLHF, LLM agents, monitoring, and policy improvement.
Published

May 3, 2026

This course is written for learners who know the earlier causal tracks but may be new to reinforcement learning. It builds RL concepts from the causal viewpoint and then studies logged decisions, policy evaluation, offline RL, RLHF, and agentic systems.

Notebook links open rendered HTML pages generated from the source notebooks under notebooks/lectures/. Code is visible by default; rendering is configured not to execute live notebook code, so local LLM or GPU-heavy cells are not triggered during website builds.

Notebook Sequence

  1. 01. RL as a Sequential Causal System
  2. 02. States, Actions, Rewards, and Policies
  3. 03. Potential Outcomes for Sequential Decisions
  4. 04. Bandits, MDPs, and Causal Estimands
  5. 05. Logged Decisions, Propensities, and Support
  6. 06. Contextual Bandits and Randomized Exploration
  7. 07. Off-Policy Evaluation with Inverse Propensity Weighting
  8. 08. Variance, Clipping, and Self-Normalized OPE
  9. 09. Doubly Robust Off-Policy Evaluation
  10. 10. Model-Based OPE and Simulators
  11. 11. Fitted Q Evaluation and Value Functions
  12. 12. Policy Learning from Observational and Logged Data
  13. 13. Dynamic Treatment Regimes and the G-Formula
  14. 14. Time-Varying Confounding and Marginal Structural Models
  15. 15. Causal Graphs for Sequential Decision Systems
  16. 16. Confounded Bandits and Observational RL
  17. 17. Instruments, Encouragement, and Policy Variation
  18. 18. Offline RL Dataset Quality and Extrapolation Risk
  19. 19. Reward Design, Proxy Outcomes, and Goodhart Effects
  20. 20. Safe Policy Improvement and Guardrails
  21. 21. Heterogeneous Policy Effects and Personalization
  22. 22. Interference, Multi-Agent Systems, and Spillovers
  23. 23. RLHF, Preference Optimization, and Causal Questions
  24. 24. RL for LLM Agents, Tool Use, and Sequential Tasks
  25. 25. Online Experiments, Monitoring, and Policy Decay
  26. 26. Capstone: Evaluating and Improving a Logged Policy

How To Read This Track

  • Work through the notebooks in order if you want the full course arc.
  • Treat each notebook as a lecture plus lab: read the discussion, inspect the code, and rerun locally when you want to experiment.
  • For AI-heavy notebooks, expect some brittleness when live model calls are enabled; that instability is part of the course material rather than something hidden from the reader.

The .ipynb sources remain in the matching folder under notebooks/lectures/.