08. Causal Inference for Reinforcement Learning
Reinforcement Learning
Off-Policy Evaluation
Causal Inference
Lecture Notes
Sequential decisions, logged policies, bandits, off-policy evaluation, offline RL, RLHF, LLM agents, monitoring, and policy improvement.
This course is written for learners who know the earlier causal tracks but may be new to reinforcement learning. It builds RL concepts from the causal viewpoint and then studies logged decisions, policy evaluation, offline RL, RLHF, and agentic systems.
Notebook links open rendered HTML pages generated from the source notebooks under notebooks/lectures/. Code is visible by default; rendering is configured not to execute live notebook code, so local LLM or GPU-heavy cells are not triggered during website builds.
Notebook Sequence
- 01. RL as a Sequential Causal System
- 02. States, Actions, Rewards, and Policies
- 03. Potential Outcomes for Sequential Decisions
- 04. Bandits, MDPs, and Causal Estimands
- 05. Logged Decisions, Propensities, and Support
- 06. Contextual Bandits and Randomized Exploration
- 07. Off-Policy Evaluation with Inverse Propensity Weighting
- 08. Variance, Clipping, and Self-Normalized OPE
- 09. Doubly Robust Off-Policy Evaluation
- 10. Model-Based OPE and Simulators
- 11. Fitted Q Evaluation and Value Functions
- 12. Policy Learning from Observational and Logged Data
- 13. Dynamic Treatment Regimes and the G-Formula
- 14. Time-Varying Confounding and Marginal Structural Models
- 15. Causal Graphs for Sequential Decision Systems
- 16. Confounded Bandits and Observational RL
- 17. Instruments, Encouragement, and Policy Variation
- 18. Offline RL Dataset Quality and Extrapolation Risk
- 19. Reward Design, Proxy Outcomes, and Goodhart Effects
- 20. Safe Policy Improvement and Guardrails
- 21. Heterogeneous Policy Effects and Personalization
- 22. Interference, Multi-Agent Systems, and Spillovers
- 23. RLHF, Preference Optimization, and Causal Questions
- 24. RL for LLM Agents, Tool Use, and Sequential Tasks
- 25. Online Experiments, Monitoring, and Policy Decay
- 26. Capstone: Evaluating and Improving a Logged Policy
How To Read This Track
- Work through the notebooks in order if you want the full course arc.
- Treat each notebook as a lecture plus lab: read the discussion, inspect the code, and rerun locally when you want to experiment.
- For AI-heavy notebooks, expect some brittleness when live model calls are enabled; that instability is part of the course material rather than something hidden from the reader.
The .ipynb sources remain in the matching folder under notebooks/lectures/.