07. Health, Education, and Policy Applications

Health, education, and public policy problems make causal inference feel different from product analytics.

The stakes are higher. Outcomes are slower. Randomization may be ethically or politically constrained. Data may come from administrative systems rather than clean product logs. Compliance is often imperfect. The final decision is rarely “ship the feature”; it may be “expand the program”, “change eligibility”, “fund the intervention”, “continue the pilot”, or “protect a subgroup from harm”.

This notebook treats applied policy evaluation as a causal decision workflow. We will study:

a health coverage lottery with imperfect take-up,
an education regression discontinuity around an eligibility cutoff,
a staggered policy rollout with event-study diagnostics,
equity and cost-effectiveness readouts,
a final decision memo that distinguishes evidence from recommendation.

Learning Goals

By the end of this notebook, you should be able to:

Translate health, education, and policy questions into explicit causal estimands.
Distinguish intention-to-treat effects from treatment-on-treated and local average treatment effects.
Use instrumental variables when randomized assignment changes treatment take-up imperfectly.
Estimate and interpret regression discontinuity effects around eligibility cutoffs.
Analyze staggered policy rollouts with fixed effects and event-study diagnostics.
Add equity, cost, and implementation constraints to causal readouts.
Write policy-facing conclusions that are honest about internal validity, external validity, and residual risk.

1. Setup

We will use pandas, numpy, scipy, statsmodels, seaborn, matplotlib, and Graphviz.

import warnings
warnings.filterwarnings("ignore")

from graphviz import Digraph
from IPython.display import Markdown, display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.special import expit
import statsmodels.formula.api as smf

sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.max_columns", 90)
pd.set_option("display.float_format", lambda x: f"{x:,.3f}")


def regression_effect(model, term):
    estimate = model.params[term]
    se = model.bse[term]
    return pd.Series(
        {
            "estimate": estimate,
            "std_error": se,
            "ci_lower": estimate - 1.96 * se,
            "ci_upper": estimate + 1.96 * se,
            "p_value": model.pvalues[term],
        }
    )


def plot_coef_table(table, title, xlabel, reference=0, figsize=(8, 4)):
    plot_df = table.sort_values("estimate")
    fig, ax = plt.subplots(figsize=figsize)
    ax.errorbar(
        x=plot_df["estimate"],
        y=plot_df.index,
        xerr=[
            plot_df["estimate"] - plot_df["ci_lower"],
            plot_df["ci_upper"] - plot_df["estimate"],
        ],
        fmt="o",
        color="#2b8cbe",
        ecolor="#a6bddb",
        elinewidth=3,
        capsize=4,
    )
    ax.axvline(reference, color="#444444", linestyle="--", linewidth=1)
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel("")
    plt.tight_layout()
    return fig, ax

2. Why Policy Causal Inference Is Hard

In product experimentation, the intervention is often a feature flag. In policy work, the intervention is often a program, benefit, rule, service model, eligibility threshold, or institutional change.

Policy evaluations have recurring complications:

the people who receive services may be selected because they need them most,
program assignment may be randomized but actual take-up may not be,
outcomes can be delayed, rare, or expensive to measure,
the intervention may have spillovers through families, schools, providers, or neighborhoods,
ethical constraints may limit who can be assigned to control,
external validity matters because the decision is usually about scaling beyond the study setting.

Deaton and Cartwright (2018) argue that randomized trials should be interpreted with realistic expectations about what they do and do not guarantee. The same caution applies to quasi-experiments: a credible design helps answer a specific causal question, not every policy question.

dot = Digraph("policy_workflow", graph_attr={"rankdir": "LR"})
dot.attr("node", shape="box", style="rounded,filled", fillcolor="#f7fbff", color="#6baed6")

dot.node("Q", "Policy question\nwho, what, when, where")
dot.node("D", "Design\nlottery, IV, RD,\nDiD, rollout")
dot.node("E", "Estimand\nITT, LATE, local RD,\nATT, equity effect")
dot.node("M", "Metrics\noutcomes, cost,\naccess, harms")
dot.node("R", "Recommendation\nscale, revise,\nhold, stop")

dot.edge("Q", "D")
dot.edge("D", "E")
dot.edge("E", "M")
dot.edge("M", "R")
dot.edge("R", "Q", label="new evidence needs", style="dashed")

dot

3. A Policy Metric Stack

Health, education, and policy projects should avoid single-number thinking. The primary outcome matters, but so do access, equity, costs, implementation quality, and harms.

metric_stack = pd.DataFrame(
    [
        {
            "domain": "Health",
            "primary_metric": "Preventive care, clinical risk, utilization, quality of life",
            "equity_question": "Did access improve for the groups facing the largest barriers?",
            "cost_question": "Did the program shift cost, reduce avoidable utilization, or improve value?",
        },
        {
            "domain": "Education",
            "primary_metric": "Learning gains, attendance, graduation, persistence",
            "equity_question": "Did the intervention reduce gaps or only raise averages?",
            "cost_question": "Are gains large relative to staff time, tutoring hours, or class capacity?",
        },
        {
            "domain": "Labor and social policy",
            "primary_metric": "Employment, earnings, housing stability, benefit take-up",
            "equity_question": "Who is screened out, discouraged, sanctioned, or burdened?",
            "cost_question": "Do administrative and participant costs change the value calculation?",
        },
        {
            "domain": "Public safety and civic systems",
            "primary_metric": "Incidents, compliance, service timeliness, trust",
            "equity_question": "Are impacts different across neighborhoods or demographic groups?",
            "cost_question": "Do benefits justify enforcement, monitoring, or service delivery costs?",
        },
    ]
)

metric_stack

	domain	primary_metric	equity_question	cost_question
0	Health	Preventive care, clinical risk, utilization, q...	Did access improve for the groups facing the l...	Did the program shift cost, reduce avoidable u...
1	Education	Learning gains, attendance, graduation, persis...	Did the intervention reduce gaps or only raise...	Are gains large relative to staff time, tutori...
2	Labor and social policy	Employment, earnings, housing stability, benef...	Who is screened out, discouraged, sanctioned, ...	Do administrative and participant costs change...
3	Public safety and civic systems	Incidents, compliance, service timeliness, trust	Are impacts different across neighborhoods or ...	Do benefits justify enforcement, monitoring, o...

4. Example 1: Health Coverage Lottery With Imperfect Take-Up

Suppose a state has funds to offer a health coverage program to some eligible low-income adults. Demand exceeds available slots, so the state runs a lottery. Lottery winners are allowed to apply, but not everyone who wins completes enrollment. Some people who lose the lottery obtain coverage through other paths.

This resembles the logic of the Oregon Health Insurance Experiment, where Oregon used a lottery to allocate a limited Medicaid expansion opportunity. Baicker et al. (2013) describe how the lottery created randomized assignment, while actual Medicaid coverage required an instrumental-variables approach because not all selected individuals enrolled.

We will simulate a simplified version.

def simulate_health_lottery(seed=101, n=9000):
    rng = np.random.default_rng(seed)

    age = rng.integers(19, 65, size=n)
    female = rng.binomial(1, 0.56, size=n)
    chronic = rng.binomial(1, expit(-1.2 + 0.035 * (age - 40)), size=n)
    income_pct_fpl = np.clip(rng.normal(95, 28, size=n), 15, 180)
    baseline_visits = rng.poisson(np.exp(-0.25 + 0.55 * chronic + 0.012 * (age - 40)), size=n)
    region = rng.choice(["North", "South", "East", "West"], size=n, p=[0.25, 0.28, 0.22, 0.25])
    high_barrier = rng.binomial(1, expit(0.65 - 0.018 * income_pct_fpl + 0.25 * chronic), size=n)

    lottery = rng.binomial(1, 0.42, size=n)
    p_coverage_if_lose = expit(-3.0 + 0.18 * chronic + 0.006 * income_pct_fpl)
    p_coverage_if_win = expit(-0.35 + 0.35 * chronic - 0.010 * income_pct_fpl - 0.45 * high_barrier)
    coverage_prob = np.where(lottery == 1, p_coverage_if_win, p_coverage_if_lose)
    coverage = rng.binomial(1, coverage_prob)

    latent_health = (
        58
        - 0.18 * (age - 40)
        - 6.0 * chronic
        + 0.035 * income_pct_fpl
        - 2.0 * high_barrier
        + rng.normal(0, 4, size=n)
    )

    true_coverage_effect = 4.0 + 1.2 * chronic + 0.8 * high_barrier
    health_score = latent_health + true_coverage_effect * coverage + rng.normal(0, 5, size=n)
    preventive_visit = rng.binomial(
        1,
        expit(-1.00 + 0.95 * coverage + 0.35 * chronic - 0.20 * high_barrier),
        size=n,
    )
    outpatient_visits = rng.poisson(
        np.exp(-0.15 + 0.36 * coverage + 0.45 * chronic + 0.08 * baseline_visits),
        size=n,
    )
    emergency_visits = rng.poisson(
        np.exp(-1.45 - 0.22 * coverage + 0.65 * chronic + 0.20 * high_barrier),
        size=n,
    )
    out_of_pocket_cost = np.clip(
        rng.normal(620 + 230 * chronic + 90 * high_barrier - 280 * coverage, 190, size=n),
        0,
        None,
    )

    return pd.DataFrame(
        {
            "age": age,
            "female": female,
            "chronic": chronic,
            "income_pct_fpl": income_pct_fpl,
            "baseline_visits": baseline_visits,
            "region": region,
            "high_barrier": high_barrier,
            "lottery": lottery,
            "coverage": coverage,
            "health_score": health_score,
            "preventive_visit": preventive_visit,
            "outpatient_visits": outpatient_visits,
            "emergency_visits": emergency_visits,
            "out_of_pocket_cost": out_of_pocket_cost,
        }
    )


df_health = simulate_health_lottery()
df_health.head()

	age	female	chronic	income_pct_fpl	baseline_visits	region	lottery	coverage	health_score	preventive_visit	outpatient_visits	emergency_visits	out_of_pocket_cost
0	33	0	0	134.463	1	South	1	1	67.984	0	0	1	451.727
1	62	0	1	68.270	3	South	1	0	55.127	1	3	1	612.640
2	51	1	0	94.652	2	West	1	0	45.674	0	2	0	569.650
3	35	0	0	150.147	1	West	1	0	66.540	0	3	0	567.644
4	25	1	0	100.819	0	East	0	0	53.417	0	2	1	575.600

The lottery is randomized. Coverage is not. Coverage is affected by paperwork, eligibility verification, barriers to enrollment, and outside coverage options.

This creates three distinct estimands:

ITT: effect of winning the lottery, regardless of whether the person enrolls.
First stage: effect of winning the lottery on coverage.
LATE: effect of coverage among compliers, estimated using lottery assignment as an instrument.

The Wald estimand is:

\[ LATE = \frac{E[Y \mid Z=1] - E[Y \mid Z=0]}{E[D \mid Z=1] - E[D \mid Z=0]} \]

where \(Z\) is lottery assignment and \(D\) is actual coverage.

balance_table = (
    df_health.groupby("lottery")
    .agg(
        n=("lottery", "size"),
        age=("age", "mean"),
        chronic=("chronic", "mean"),
        income_pct_fpl=("income_pct_fpl", "mean"),
        baseline_visits=("baseline_visits", "mean"),
        high_barrier=("high_barrier", "mean"),
    )
    .rename(index={0: "Lottery control", 1: "Lottery winner"})
)
balance_table.loc["Difference"] = balance_table.loc["Lottery winner"] - balance_table.loc["Lottery control"]
balance_table

	n	age	chronic	income_pct_fpl	baseline_visits	high_barrier
lottery
Lottery control	5,210.000	41.461	0.249	95.189	0.962	0.279
Lottery winner	3,790.000	41.439	0.259	95.400	0.975	0.272
Difference	-1,420.000	-0.022	0.010	0.211	0.013	-0.008

The pre-treatment variables are balanced up to random noise. That gives the lottery assignment a causal interpretation. It does not make actual coverage as-if randomized.

coverage_by_lottery = (
    df_health.groupby("lottery")["coverage"]
    .agg(["mean", "count"])
    .rename(index={0: "Lottery control", 1: "Lottery winner"})
)
first_stage_raw = coverage_by_lottery.loc["Lottery winner", "mean"] - coverage_by_lottery.loc["Lottery control", "mean"]

coverage_by_lottery, first_stage_raw

(                 mean  count
 lottery                     
 Lottery control 0.083   5210
 Lottery winner  0.216   3790,
 np.float64(0.13286555690041985))

The first stage is large: winning the lottery substantially increases coverage probability. A weak first stage would make an IV estimate unstable and hard to interpret.

outcome_list = [
    "health_score",
    "preventive_visit",
    "outpatient_visits",
    "emergency_visits",
    "out_of_pocket_cost",
]

itt_rows = []
for outcome in outcome_list:
    model = smf.ols(
        f"{outcome} ~ lottery + age + female + chronic + income_pct_fpl + baseline_visits + high_barrier + C(region)",
        data=df_health,
    ).fit(cov_type="HC1")
    row = regression_effect(model, "lottery")
    row["outcome"] = outcome
    itt_rows.append(row)

itt_table = pd.DataFrame(itt_rows).set_index("outcome")
itt_table

	estimate	std_error	ci_lower	ci_upper	p_value
outcome
health_score	0.626	0.141	0.350	0.902	0.000
preventive_visit	0.043	0.010	0.024	0.063	0.000
outpatient_visits	0.106	0.024	0.059	0.152	0.000
emergency_visits	-0.017	0.012	-0.040	0.006	0.140
out_of_pocket_cost	-39.220	4.548	-48.135	-30.306	0.000

The ITT answers a program access question: what is the effect of being offered the chance to enroll?

For policy operations, ITT is often the most honest estimate of the program as implemented. If paperwork barriers prevent enrollment, the ITT includes that implementation reality.

wald_rows = []
for outcome in outcome_list:
    y1 = df_health.loc[df_health["lottery"] == 1, outcome].mean()
    y0 = df_health.loc[df_health["lottery"] == 0, outcome].mean()
    d1 = df_health.loc[df_health["lottery"] == 1, "coverage"].mean()
    d0 = df_health.loc[df_health["lottery"] == 0, "coverage"].mean()
    itt = y1 - y0
    first_stage = d1 - d0
    late = itt / first_stage
    wald_rows.append(
        {
            "outcome": outcome,
            "itt_raw": itt,
            "first_stage": first_stage,
            "wald_late": late,
        }
    )

wald_table = pd.DataFrame(wald_rows).set_index("outcome")
wald_table

	itt_raw	first_stage	wald_late
outcome
health_score	0.589	0.133	4.431
preventive_visit	0.045	0.133	0.335
outpatient_visits	0.113	0.133	0.851
emergency_visits	-0.016	0.133	-0.117
out_of_pocket_cost	-37.786	0.133	-284.396

The LATE answers a different question: among people whose coverage status was changed by the lottery, what was the effect of coverage?

Angrist, Imbens, and Rubin (1996) formalized this local average treatment effect interpretation for instrumental variables. Imbens (2014) reviews the IV logic and emphasizes that the LATE is a real but local estimand.

first_stage_model = smf.ols(
    "coverage ~ lottery + age + female + chronic + income_pct_fpl + baseline_visits + high_barrier + C(region)",
    data=df_health,
).fit(cov_type="HC1")

df_health = df_health.assign(coverage_hat=first_stage_model.fittedvalues)

iv_health_model = smf.ols(
    "health_score ~ coverage_hat + age + female + chronic + income_pct_fpl + baseline_visits + high_barrier + C(region)",
    data=df_health,
).fit(cov_type="HC1")

naive_health_model = smf.ols(
    "health_score ~ coverage + age + female + chronic + income_pct_fpl + baseline_visits + high_barrier + C(region)",
    data=df_health,
).fit(cov_type="HC1")

iv_comparison = pd.DataFrame(
    {
        "Naive adjusted coverage coefficient": regression_effect(naive_health_model, "coverage"),
        "Teaching 2SLS point estimate": regression_effect(iv_health_model, "coverage_hat"),
        "First stage lottery coefficient": regression_effect(first_stage_model, "lottery"),
    }
).T

iv_comparison

	estimate	std_error	ci_lower	ci_upper
Naive adjusted coverage coefficient	4.807	0.191	4.432	5.182
Teaching 2SLS point estimate	4.737	1.065	2.650	6.824
First stage lottery coefficient	0.132	0.008	0.117	0.147

The two-stage estimate here is for teaching the mechanics. In production work, use an IV estimator that computes correct two-stage standard errors and diagnostics. The conceptual message is the same:

comparing covered and uncovered people can remain confounded,
comparing lottery winners and losers gives the ITT,
scaling the ITT by the first stage gives a complier-focused effect of actual coverage.

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.barplot(
    data=df_health,
    x="lottery",
    y="coverage",
    ax=axes[0],
    color="#74c476",
)
axes[0].set_title("Lottery strongly shifts coverage")
axes[0].set_xticklabels(["Control", "Winner"])
axes[0].set_xlabel("")
axes[0].set_ylabel("Coverage rate")

sns.barplot(
    data=df_health,
    x="lottery",
    y="health_score",
    ax=axes[1],
    color="#6baed6",
)
axes[1].set_title("ITT effect on health score")
axes[1].set_xticklabels(["Control", "Winner"])
axes[1].set_xlabel("")
axes[1].set_ylabel("Average health score")

plt.tight_layout()
plt.show()

5. Equity Readout for the Lottery

Policy decisions should not stop at the average effect. If enrollment barriers are higher for some groups, the same lottery can produce unequal access.

equity_first_stage = (
    df_health.groupby(["high_barrier", "lottery"])["coverage"]
    .mean()
    .unstack()
    .rename(index={0: "Lower enrollment barrier", 1: "Higher enrollment barrier"})
)
equity_first_stage["first_stage"] = equity_first_stage[1] - equity_first_stage[0]
equity_first_stage

lottery	0	1	first_stage
high_barrier
Lower enrollment barrier	0.083	0.232	0.149
Higher enrollment barrier	0.085	0.175	0.090

equity_itt_rows = []
for group_value, group_df in df_health.groupby("high_barrier"):
    label = "Higher enrollment barrier" if group_value == 1 else "Lower enrollment barrier"
    model = smf.ols(
        "health_score ~ lottery + age + female + chronic + income_pct_fpl + baseline_visits + C(region)",
        data=group_df,
    ).fit(cov_type="HC1")
    row = regression_effect(model, "lottery")
    row["group"] = label
    equity_itt_rows.append(row)

equity_itt = pd.DataFrame(equity_itt_rows).set_index("group")
equity_itt

	estimate	std_error	ci_lower	ci_upper	p_value
group
Lower enrollment barrier	0.728	0.166	0.403	1.053	0.000
Higher enrollment barrier	0.330	0.266	-0.190	0.851	0.214

A program can be effective on average while still failing an equity goal. Here, the higher-barrier group receives a smaller first-stage boost, which means the administrative pathway weakens access precisely where support may be needed most.

6. Example 2: Education Regression Discontinuity

Now suppose a school district offers intensive tutoring to students whose diagnostic score falls below a cutoff. Students just below the cutoff are eligible; students just above are not.

Regression discontinuity uses the cutoff rule as a source of quasi-experimental variation. The identifying idea is local: students just below and just above the cutoff should be similar except for eligibility.

Lee and Lemieux (2010) describe RD as a design with transparent graphs and concrete assumptions. Wing and Bello-Gomez (2018) emphasize that RD estimates are often highly credible near the cutoff but may not generalize to the entire target population.

def simulate_tutoring_rd(seed=202, n=6500):
    rng = np.random.default_rng(seed)
    score_centered = np.clip(rng.normal(0, 18, size=n), -45, 45)
    below_cutoff = (score_centered < 0).astype(int)
    school = rng.choice([f"School {i+1:02d}" for i in range(18)], size=n)
    english_learner = rng.binomial(1, expit(-0.65 - 0.02 * score_centered), size=n)
    prior_attendance = np.clip(rng.normal(0.90 + 0.002 * score_centered - 0.04 * english_learner, 0.05, size=n), 0.60, 1.00)

    tutoring_hours = np.where(
        below_cutoff == 1,
        rng.poisson(22 + 4 * english_learner, size=n),
        rng.poisson(2 + 1 * english_learner, size=n),
    )
    treatment = (tutoring_hours >= 10).astype(int)
    school_effect = {s: rng.normal(0, 1.2) for s in np.unique(school)}

    endline_score = (
        70
        + 0.48 * score_centered
        - 0.010 * score_centered**2 / 10
        + 3.8 * treatment
        + 2.1 * english_learner * treatment
        + 7.0 * (prior_attendance - 0.90)
        + np.array([school_effect[s] for s in school])
        + rng.normal(0, 5.8, size=n)
    )

    return pd.DataFrame(
        {
            "score_centered": score_centered,
            "below_cutoff": below_cutoff,
            "tutoring_hours": tutoring_hours,
            "treatment": treatment,
            "school": school,
            "english_learner": english_learner,
            "prior_attendance": prior_attendance,
            "endline_score": endline_score,
        }
    )


df_rd = simulate_tutoring_rd()
df_rd.head()

	score_centered	below_cutoff	tutoring_hours	treatment	school	prior_attendance	endline_score
0	32.611	0	2	0	School 06	0.976	91.855
1	-13.123	1	27	1	School 14	0.880	60.940
2	-19.541	1	24	1	School 16	0.916	64.732
3	-7.234	1	22	1	School 07	0.868	63.909
4	-20.747	1	17	1	School 06	0.813	75.583

The running variable is the diagnostic score centered at the cutoff. Negative values are below the cutoff and eligible for tutoring.

First inspect the assignment rule.

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.scatterplot(
    data=df_rd.sample(1200, random_state=4),
    x="score_centered",
    y="tutoring_hours",
    hue="below_cutoff",
    palette={0: "#9ecae1", 1: "#31a354"},
    alpha=0.45,
    ax=axes[0],
    legend=False,
)
axes[0].axvline(0, color="#444444", linestyle="--")
axes[0].set_title("Eligibility cutoff changes tutoring intensity")
axes[0].set_xlabel("Diagnostic score relative to cutoff")
axes[0].set_ylabel("Tutoring hours")

sns.histplot(
    data=df_rd,
    x="score_centered",
    bins=45,
    color="#756bb1",
    ax=axes[1],
)
axes[1].axvline(0, color="#444444", linestyle="--")
axes[1].set_title("Running variable density around cutoff")
axes[1].set_xlabel("Diagnostic score relative to cutoff")

plt.tight_layout()
plt.show()

The histogram is a basic manipulation diagnostic. If families or schools can precisely manipulate the running variable, the density may jump around the cutoff. A visible jump does not prove manipulation, but it demands investigation.

df_rd = df_rd.assign(
    score_bin=pd.cut(df_rd["score_centered"], bins=np.arange(-45, 46, 3), include_lowest=True)
)

binned = (
    df_rd.groupby("score_bin")
    .agg(
        score_mid=("score_centered", "mean"),
        endline_score=("endline_score", "mean"),
        tutoring_rate=("treatment", "mean"),
        n=("endline_score", "size"),
    )
    .dropna()
    .reset_index(drop=True)
)

fig, ax = plt.subplots(figsize=(9, 5))
sns.scatterplot(
    data=binned,
    x="score_mid",
    y="endline_score",
    size="n",
    color="#2b8cbe",
    legend=False,
    ax=ax,
)
ax.axvline(0, color="#444444", linestyle="--")
ax.set_title("Binned RD plot: endline score by diagnostic score")
ax.set_xlabel("Diagnostic score relative to cutoff")
ax.set_ylabel("Average endline score")
plt.tight_layout()
plt.show()

The visual estimand is the jump in the conditional mean of the outcome at the cutoff.

For a sharp RD:

\[ \tau_{RD} = \lim_{x \downarrow c} E[Y \mid X=x] - \lim_{x \uparrow c} E[Y \mid X=x] \]

Here, treatment intensity changes sharply but not perfectly. For teaching, we will estimate the effect of eligibility using local linear regressions around the cutoff.

def rd_estimate(df, bandwidth):
    local = df.loc[df["score_centered"].abs() <= bandwidth].copy()
    model = smf.ols(
        "endline_score ~ below_cutoff + score_centered + below_cutoff:score_centered + english_learner + prior_attendance",
        data=local,
    ).fit(cov_type="HC1")
    out = regression_effect(model, "below_cutoff")
    out["bandwidth"] = bandwidth
    out["n"] = len(local)
    return out


rd_results = pd.DataFrame([rd_estimate(df_rd, bw) for bw in [6, 9, 12, 18, 24]]).set_index("bandwidth")
rd_results

	estimate	std_error	ci_lower	ci_upper	p_value	n
bandwidth
6.000	5.010	0.563	3.907	6.113	0.000	1,675.000
9.000	4.479	0.467	3.563	5.394	0.000	2,485.000
12.000	4.309	0.410	3.504	5.113	0.000	3,191.000
18.000	4.227	0.342	3.556	4.898	0.000	4,403.000
24.000	4.387	0.306	3.787	4.988	0.000	5,272.000

Bandwidth is a bias-variance choice. Narrow bandwidths compare more similar students but use fewer observations. Wider bandwidths use more data but rely more heavily on functional-form assumptions.

plot_coef_table(
    rd_results.rename(index=lambda x: f"Bandwidth {x}"),
    title="RD eligibility effect across bandwidths",
    xlabel="Eligibility effect on endline score",
    figsize=(8.5, 4.5),
)
plt.show()

The RD estimate should be described precisely:

It is local to students near the cutoff.
It estimates the effect of eligibility, not necessarily the effect of every tutoring hour.
It depends on continuity: absent the program, the outcome trend should have been smooth through the cutoff.
It is not automatically the right estimate for students far below the cutoff.

rd_equity = []
for group_value, group_df in df_rd.groupby("english_learner"):
    label = "English learner" if group_value == 1 else "Not English learner"
    row = rd_estimate(group_df, 12)
    row["group"] = label
    rd_equity.append(row)

rd_equity = pd.DataFrame(rd_equity).set_index("group")
rd_equity

	estimate	std_error	ci_lower	ci_upper	p_value	bandwidth	n
group
Not English learner	3.842	0.500	2.862	4.822	0.000	12.000	2,118.000
English learner	5.132	0.717	3.727	6.537	0.000	12.000	1,073.000

Subgroup RD estimates are often noisy because they use only observations near the cutoff. Still, they can guide implementation questions:

Does the same eligibility threshold work for all students?
Are support hours sufficient for students with additional needs?
Is the program improving average scores while leaving some groups behind?

7. Example 3: Staggered Policy Rollout

Now suppose a public health agency rolls out a community care coordination program across counties. The program is intended to reduce preventable emergency visits by connecting high-risk residents to primary care, medication support, transportation assistance, and follow-up.

Counties adopt the program in different quarters. Some counties are not treated during the observation window.

This is a difference-in-differences setting. The key assumption is that, absent treatment, treated and comparison counties would have followed parallel outcome trends. Callaway (2022) reviews DiD for policy evaluation, and Callaway and Sant’Anna (2021) discuss complications with multiple time periods and staggered adoption.

def simulate_policy_rollout(seed=303, n_counties=72, n_quarters=28):
    rng = np.random.default_rng(seed)
    counties = [f"County {i+1:02d}" for i in range(n_counties)]
    adoption_options = np.array([10, 14, 18, 999])
    adoption_probs = np.array([0.25, 0.25, 0.25, 0.25])
    adoption_quarter = dict(zip(counties, rng.choice(adoption_options, size=n_counties, p=adoption_probs)))
    deprivation = {c: rng.binomial(1, 0.45) for c in counties}
    county_fe = {c: rng.normal(0, 3.2) + 3.5 * deprivation[c] for c in counties}
    county_trend = {c: rng.normal(0.02, 0.05) for c in counties}

    rows = []
    for county in counties:
        adopt = adoption_quarter[county]
        high_deprivation = deprivation[county]
        for quarter in range(n_quarters):
            treated = int(quarter >= adopt)
            event_time = quarter - adopt if adopt < 900 else np.nan
            seasonal = 1.0 * np.sin(2 * np.pi * quarter / 4)
            secular = -0.06 * quarter
            dynamic_effect = 0
            if treated:
                dynamic_effect = -0.9 - 0.23 * min(max(event_time, 0), 6) - 0.55 * high_deprivation
            outcome = (
                34
                + county_fe[county]
                + seasonal
                + secular
                + county_trend[county] * quarter
                + dynamic_effect
                + rng.normal(0, 1.8)
            )
            program_cost = treated * (1.8 + 0.55 * high_deprivation + rng.normal(0, 0.08))
            rows.append(
                {
                    "county": county,
                    "quarter": quarter,
                    "adoption_quarter": adopt,
                    "treated": treated,
                    "event_time": event_time,
                    "high_deprivation": high_deprivation,
                    "preventable_ed_rate": outcome,
                    "program_cost_per_resident": max(program_cost, 0),
                }
            )

    return pd.DataFrame(rows)


df_policy = simulate_policy_rollout()
df_policy.head()

	county	quarter	adoption_quarter	event_time	preventable_ed_rate
0	County 01	0	10	-10.000	29.885
1	County 01	1	10	-9.000	34.583
2	County 01	2	10	-8.000	30.227
3	County 01	3	10	-7.000	31.381
4	County 01	4	10	-6.000	27.667

rollout_summary = (
    df_policy.drop_duplicates("county")
    .assign(cohort=lambda d: np.where(d["adoption_quarter"] == 999, "Never treated", d["adoption_quarter"].astype(int).astype(str)))
    .groupby(["cohort", "high_deprivation"])
    .size()
    .unstack(fill_value=0)
    .rename(columns={0: "Lower deprivation", 1: "Higher deprivation"})
)

rollout_summary

high_deprivation	Lower deprivation	Higher deprivation
cohort
10	15	5
14	10	10
18	8	7
Never treated	10	7

First estimate a simple two-way fixed effects model:

\[ Y_{ct} = \alpha_c + \lambda_t + \tau D_{ct} + \epsilon_{ct} \]

where \(c\) indexes county and \(t\) indexes quarter.

twfe_model = smf.ols(
    "preventable_ed_rate ~ treated + C(county) + C(quarter)",
    data=df_policy,
).fit(
    cov_type="cluster",
    cov_kwds={"groups": df_policy["county"]},
)

twfe_effect = regression_effect(twfe_model, "treated")
twfe_effect

estimate    -2.202
std_error    0.180
ci_lower    -2.555
ci_upper    -1.849
p_value      0.000
dtype: float64

The TWFE coefficient summarizes treated versus untreated county-quarters after county and quarter fixed effects. It is useful as a starting point, but staggered adoption can make this average hard to interpret when treatment effects vary across cohorts or over time.

The event study is the diagnostic workhorse.

event_window = list(range(-8, 9))
event_window.remove(-1)
event_cols = []

for k in event_window:
    col = f"event_m{abs(k)}" if k < 0 else f"event_p{k}"
    df_policy[col] = ((df_policy["event_time"] == k).astype(int)).fillna(0)
    event_cols.append(col)

event_formula = "preventable_ed_rate ~ " + " + ".join(event_cols) + " + C(county) + C(quarter)"
event_model = smf.ols(event_formula, data=df_policy).fit(
    cov_type="cluster",
    cov_kwds={"groups": df_policy["county"]},
)

event_results = []
for k, col in zip(event_window, event_cols):
    est = event_model.params[col]
    se = event_model.bse[col]
    event_results.append(
        {
            "event_time": k,
            "estimate": est,
            "std_error": se,
            "ci_lower": est - 1.96 * se,
            "ci_upper": est + 1.96 * se,
        }
    )

event_results = pd.DataFrame(event_results)
event_results.head()

	event_time	estimate	std_error	ci_lower	ci_upper
0	-8	0.973	0.340	0.306	1.640
1	-7	0.570	0.349	-0.114	1.253
2	-6	0.471	0.310	-0.137	1.080
3	-5	0.866	0.262	0.353	1.380
4	-4	0.926	0.345	0.250	1.603

fig, ax = plt.subplots(figsize=(9, 4.8))

ax.errorbar(
    event_results["event_time"],
    event_results["estimate"],
    yerr=[
        event_results["estimate"] - event_results["ci_lower"],
        event_results["ci_upper"] - event_results["estimate"],
    ],
    fmt="o-",
    color="#2b8cbe",
    ecolor="#a6bddb",
    capsize=4,
)
ax.axhline(0, color="#444444", linestyle="--", linewidth=1)
ax.axvline(-1, color="#777777", linestyle=":", linewidth=1)
ax.set_title("Event study: community care policy rollout")
ax.set_xlabel("Quarters relative to adoption")
ax.set_ylabel("Effect on preventable ED visits per 10,000")
plt.tight_layout()
plt.show()

A credible event study should show no strong pre-treatment differences in the periods before adoption. Post-treatment coefficients should move after, not before, the policy begins.

In this simulation the post-treatment effect becomes more negative over time, which is plausible for a coordination program that takes time to change utilization.

heterogeneity_model = smf.ols(
    "preventable_ed_rate ~ treated + treated:high_deprivation + C(county) + C(quarter)",
    data=df_policy,
).fit(
    cov_type="cluster",
    cov_kwds={"groups": df_policy["county"]},
)

heterogeneity_table = pd.DataFrame(
    {
        "Base treated effect": regression_effect(heterogeneity_model, "treated"),
        "Additional effect in high-deprivation counties": regression_effect(heterogeneity_model, "treated:high_deprivation"),
    }
).T

heterogeneity_table

	estimate	std_error	ci_lower	ci_upper	p_value
Base treated effect	-2.128	0.201	-2.521	-1.734	0.000
Additional effect in high-deprivation counties	-0.185	0.255	-0.684	0.315	0.469

The interaction term asks whether the effect differs in higher-deprivation counties. In policy work, this is not a decorative subgroup analysis. It is central to whether a program is reducing or widening inequity.

cost_effect_model = smf.ols(
    "program_cost_per_resident ~ treated + C(county) + C(quarter)",
    data=df_policy,
).fit(
    cov_type="cluster",
    cov_kwds={"groups": df_policy["county"]},
)

ed_effect = twfe_effect["estimate"]
cost_effect = regression_effect(cost_effect_model, "treated")["estimate"]
assumed_cost_per_preventable_ed_visit = 1200
residents_per_county = 100000

avoided_visits_per_county_quarter = -ed_effect * residents_per_county / 10000
gross_savings = avoided_visits_per_county_quarter * assumed_cost_per_preventable_ed_visit
program_cost = cost_effect * residents_per_county
net_savings = gross_savings - program_cost

cost_readout = pd.Series(
    {
        "ED rate effect per 10,000 residents": ed_effect,
        "Avoided visits per county-quarter": avoided_visits_per_county_quarter,
        "Gross avoided ED cost per county-quarter": gross_savings,
        "Program cost per county-quarter": program_cost,
        "Net savings per county-quarter": net_savings,
    }
)

cost_readout

ED rate effect per 10,000 residents              -2.202
Avoided visits per county-quarter                22.020
Gross avoided ED cost per county-quarter     26,424.142
Program cost per county-quarter             201,989.251
Net savings per county-quarter             -175,565.109
dtype: float64

Cost-effectiveness calculations should be transparent about assumptions. Here the dollar value depends on an assumed cost per preventable emergency visit and a stylized county size. In real policy work, the analyst would show sensitivity ranges.

cost_sensitivity = []
for ed_cost in [800, 1000, 1200, 1600, 2000]:
    gross = avoided_visits_per_county_quarter * ed_cost
    cost_sensitivity.append(
        {
            "assumed_cost_per_avoided_visit": ed_cost,
            "gross_savings": gross,
            "program_cost": program_cost,
            "net_savings": gross - program_cost,
        }
    )

cost_sensitivity = pd.DataFrame(cost_sensitivity)
cost_sensitivity

	assumed_cost_per_avoided_visit	gross_savings	program_cost	net_savings
0	800	17,616.095	201,989.251	-184,373.156
1	1000	22,020.118	201,989.251	-179,969.133
2	1200	26,424.142	201,989.251	-175,565.109
3	1600	35,232.189	201,989.251	-166,757.062
4	2000	44,040.237	201,989.251	-157,949.014

fig, ax = plt.subplots(figsize=(8, 4))
sns.lineplot(
    data=cost_sensitivity,
    x="assumed_cost_per_avoided_visit",
    y="net_savings",
    marker="o",
    color="#31a354",
    ax=ax,
)
ax.axhline(0, color="#444444", linestyle="--")
ax.set_title("Net savings sensitivity")
ax.set_xlabel("Assumed cost per avoided ED visit")
ax.set_ylabel("Net savings per county-quarter")
plt.tight_layout()
plt.show()

8. Evidence to Decision: Three Different Answers

A policy leader may ask, “Does this work?” The honest answer often has layers.

decision_layers = pd.DataFrame(
    [
        {
            "question": "Did program access matter?",
            "design": "Lottery or randomized encouragement",
            "estimand": "ITT",
            "answer_style": "Effect of offering access under real implementation.",
        },
        {
            "question": "Did actual treatment matter for those induced to participate?",
            "design": "Instrumental variables",
            "estimand": "LATE",
            "answer_style": "Effect for compliers, not necessarily everyone.",
        },
        {
            "question": "Did an eligibility threshold change outcomes?",
            "design": "Regression discontinuity",
            "estimand": "Local effect at cutoff",
            "answer_style": "Highly local answer around the threshold.",
        },
        {
            "question": "Did a rolled-out policy change outcomes over time?",
            "design": "Difference-in-differences or event study",
            "estimand": "ATT for treated units over observed rollout",
            "answer_style": "Depends on parallel trends and credible comparison groups.",
        },
        {
            "question": "Should we scale?",
            "design": "Decision analysis",
            "estimand": "Expected net value with equity and implementation risk",
            "answer_style": "Combines causal evidence, cost, feasibility, and ethics.",
        },
    ]
)

decision_layers

	question	design	estimand	answer_style
0	Did program access matter?	Lottery or randomized encouragement	ITT	Effect of offering access under real implement...
1	Did actual treatment matter for those induced ...	Instrumental variables	LATE	Effect for compliers, not necessarily everyone.
2	Did an eligibility threshold change outcomes?	Regression discontinuity	Local effect at cutoff	Highly local answer around the threshold.
3	Did a rolled-out policy change outcomes over t...	Difference-in-differences or event study	ATT for treated units over observed rollout	Depends on parallel trends and credible compar...
4	Should we scale?	Decision analysis	Expected net value with equity and implementat...	Combines causal evidence, cost, feasibility, a...

9. Policy Decision Memo

The memo should not overstate evidence. It should separate the causal estimate from the implementation recommendation.

ed_ci_lower = twfe_effect["ci_lower"]
ed_ci_upper = twfe_effect["ci_upper"]
equity_extra = heterogeneity_table.loc["Additional effect in high-deprivation counties", "estimate"]

recommendation = "Scale with staged monitoring" if (ed_effect < 0 and net_savings > 0) else "Continue pilot before scaling"

memo = f'''
### Policy Evaluation Decision Memo

**Policy:** community care coordination rollout across counties.

**Design:** staggered difference-in-differences with county and quarter fixed effects, supported by event-study diagnostics.

**Primary effect:** the policy changed preventable emergency visits by **{ed_effect:,.2f}** per 10,000 residents
with a 95% confidence interval from **{ed_ci_lower:,.2f}** to **{ed_ci_upper:,.2f}**.
Negative values indicate fewer preventable visits.

**Equity readout:** the additional effect in higher-deprivation counties was **{equity_extra:,.2f}** per 10,000 residents.
This suggests the program may deliver larger gains in counties with higher baseline need, but subgroup estimates should be monitored as rollout expands.

**Economic readout:** under the assumed avoided-visit cost of **${assumed_cost_per_preventable_ed_visit:,.0f}**, estimated net savings are
**${net_savings:,.0f}** per county-quarter.

**Recommendation:** **{recommendation}.**

**Conditions:** preserve a holdout or delayed-rollout comparison group where feasible, track implementation fidelity, report subgroup outcomes, and update the cost-effectiveness model with observed program costs.
'''

display(Markdown(memo))

Policy Evaluation Decision Memo

Policy: community care coordination rollout across counties.

Design: staggered difference-in-differences with county and quarter fixed effects, supported by event-study diagnostics.

Primary effect: the policy changed preventable emergency visits by -2.20 per 10,000 residents with a 95% confidence interval from -2.55 to -1.85. Negative values indicate fewer preventable visits.

Equity readout: the additional effect in higher-deprivation counties was -0.18 per 10,000 residents. This suggests the program may deliver larger gains in counties with higher baseline need, but subgroup estimates should be monitored as rollout expands.

Economic readout: under the assumed avoided-visit cost of \(1,200**, estimated net savings are **\)-175,565 per county-quarter.

Recommendation: Continue pilot before scaling.

Conditions: preserve a holdout or delayed-rollout comparison group where feasible, track implementation fidelity, report subgroup outcomes, and update the cost-effectiveness model with observed program costs.

10. Common Pitfalls in Health, Education, and Policy Work

Confusing access with treatment: assignment, eligibility, enrollment, attendance, and completion are different variables.
Conditioning on post-treatment variables: for example, analyzing only patients who visit a clinic after coverage changes access to clinics.
Ignoring implementation: a weak ITT may mean the program is ineffective, or it may mean the program did not reach people.
Overgeneralizing local effects: RD estimates near the cutoff do not automatically apply to people far from the cutoff.
Ignoring staggered DiD complications: with heterogeneous effects, simple TWFE can be hard to interpret.
Under-reporting harms: average gains can hide worse outcomes for a subgroup.
Treating cost as an afterthought: policy decisions need value, not just statistical significance.

11. Exercises

In the health lottery simulation, reduce enrollment among the higher-barrier group. How do ITT, first stage, and LATE change?
Add an exclusion restriction violation to the lottery simulation by letting lottery winners receive navigation help even without coverage. What happens to the IV interpretation?
In the tutoring RD simulation, create manipulation around the cutoff by moving some students just below zero. How does the histogram change?
Estimate the tutoring effect using a much wider bandwidth. Why can the result change?
In the policy rollout simulation, make early-adopting counties have improving pre-trends. What does the event study reveal?
Build a decision memo that recommends against scaling even when the primary causal effect is positive. Which guardrail or cost assumption drives the recommendation?

12. Key Takeaways

Policy evaluation starts with a decision, but it should not let the desired decision define the evidence.
Lotteries identify ITT effects cleanly; actual treatment effects require compliance logic.
Instrumental variables estimate a local effect for compliers under assumptions such as relevance, exclusion, monotonicity, and as-if random assignment.
Regression discontinuity can be highly credible near a cutoff, but it is local.
Difference-in-differences can evaluate staggered policy rollouts, but pre-trends, comparison groups, and heterogeneous effects matter.
Health, education, and policy readouts should include equity, cost, implementation fidelity, and uncertainty.

References

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434), 444-455. https://doi.org/10.1080/01621459.1996.10476902

Baicker, K., Taubman, S. L., Allen, H. L., Bernstein, M., Gruber, J. H., Newhouse, J. P., Schneider, E. C., Wright, B. J., Zaslavsky, A. M., & Finkelstein, A. N. (2013). The Oregon experiment - effects of Medicaid on clinical outcomes. New England Journal of Medicine, 368(18), 1713-1722. https://doi.org/10.1056/NEJMsa1212321

Callaway, B. (2022). Difference-in-differences for policy evaluation. arXiv. https://doi.org/10.48550/arxiv.2203.15646

Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. https://doi.org/10.1016/j.jeconom.2020.12.001

Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized controlled trials. Social Science & Medicine, 210, 2-21. https://doi.org/10.1016/j.socscimed.2017.12.005

Imbens, G. W. (2014). Instrumental variables: An econometrician’s perspective. Statistical Science, 29(3), 323-358. https://doi.org/10.1214/14-STS480

Lee, D. S., & Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature, 48(2), 281-355. https://doi.org/10.1257/jel.48.2.281

Wing, C., & Bello-Gomez, R. A. (2018). Regression discontinuity and beyond. American Journal of Evaluation, 39(1), 91-108. https://doi.org/10.1177/1098214017736155