DoubleML Tutorial 06: Interactive IV Model IIVM

This notebook is a full tutorial on DoubleMLIIVM, the interactive instrumental-variable model for a binary treatment and a binary instrument.

The previous notebook handled binary treatments under observed-control adjustment. Here we add noncompliance and instrument-induced treatment changes. The treatment is binary, the instrument is binary, and the target is a local effect: the effect for units whose treatment status is shifted by the instrument.

The main point is conceptual as much as computational. Binary IV estimates are not ordinary ATE estimates. They are local effects tied to compliance behavior and instrument validity.

Learning Goals

By the end of this notebook, you should be able to:

Explain the IIVM setup for binary treatment and binary instrument data.
Distinguish ATE, ATT-style targets, and LATE.
Explain compliers, always-takers, never-takers, and defiers.
State the IV assumptions for binary instruments: relevance, exclusion, independence after controls, and monotonicity.
Understand the DoubleMLIIVM nuisance roles: ml_g, ml_m, and ml_r.
Manually compute the cross-fitted IIVM ratio score.
Fit DoubleMLIIVM and diagnose first-stage strength, instrument overlap, nuisance losses, score contributions, and split stability.
Explain why weak instruments, invalid instruments, or wrong subgroup assumptions change the meaning of the estimate.

From ATE To LATE

For a binary treatment, the ATE is:

\[ E[Y(1)-Y(0)]. \]

That target averages treatment effects over everyone. In a binary IV design with noncompliance, the instrument does not necessarily move everyone into treatment. Some units take treatment regardless of the instrument, some never take treatment, and some take treatment only when encouraged.

The local average treatment effect is the treatment effect for the instrument-responsive units:

\[ LATE = E[Y(1)-Y(0) \mid \text{complier}]. \]

When compliance probability varies with covariates, the model target is a compliance-weighted local effect. In the synthetic data we can compute that target exactly; in real data we usually cannot observe compliance type directly.

Compliance Types

With a binary instrument Z and binary treatment D, define two potential treatment states:

\[ D(1) \quad \text{and} \quad D(0). \]

D(1) is whether the unit would take treatment if instrumented. D(0) is whether the unit would take treatment if not instrumented.

The common compliance types are:

Complier: D(1)=1, D(0)=0.
Always-taker: D(1)=1, D(0)=1.
Never-taker: D(1)=0, D(0)=0.
Defier: D(1)=0, D(0)=1.

The usual monotonicity assumption rules out defiers: the instrument should not make anyone less likely to take treatment.

IIVM Assumptions

A binary IV design needs several assumptions:

Relevance: the instrument changes treatment probability after adjusting for controls.
Exclusion: the instrument affects the outcome only through treatment.
Conditional independence: after adjusting for controls, the instrument is independent of potential outcomes and potential treatment states.
Monotonicity: the instrument does not push some units in the opposite direction.
Overlap for the instrument: every relevant covariate profile has a nonzero chance of both instrument states.

The data can help screen relevance and overlap. Exclusion, independence, and monotonicity require design evidence.

IIVM Nuisance Functions

DoubleMLIIVM estimates these nuisance components:

\[ g_0(0, X) = E[Y \mid Z=0, X] \]

\[ g_0(1, X) = E[Y \mid Z=1, X] \]

\[ m_0(X) = P(Z=1 \mid X) \]

and

\[ r_0(0, X) = P(D=1 \mid Z=0, X), \quad r_0(1, X) = P(D=1 \mid Z=1, X). \]

In the API:

ml_g predicts the outcome as a function of instrument status and controls.
ml_m predicts the instrument propensity.
ml_r predicts treatment take-up as a function of instrument status and controls.

The target is the ratio:

\[ \theta_0 = \frac{E[g_0(1,X)-g_0(0,X)]}{E[r_0(1,X)-r_0(0,X)]}. \]

Runtime Note

This notebook fits several cross-fitted binary IV models, a manual IIVM estimator, subgroup comparisons, and a repeated-splitting check. On a typical laptop, the full run should take roughly two to five minutes.

The stress tests use synthetic oracle calculations where possible so that the conceptual lessons stay fast.

Setup

This cell prepares the notebook environment. It creates output folders, keeps matplotlib cache files inside the tutorial output folder, imports scientific Python libraries, and records package versions.

The path logic supports running from the repository root or directly from the tutorial folder.

from pathlib import Path
import os
import warnings

NOTEBOOK_PREFIX = "06"
RANDOM_STATE = 2026

CURRENT_DIR = Path.cwd().resolve()
if CURRENT_DIR.name == "doubleml":
    REPO_ROOT = CURRENT_DIR.parents[2]
elif (CURRENT_DIR / "notebooks" / "tutorials" / "doubleml").exists():
    REPO_ROOT = CURRENT_DIR
else:
    REPO_ROOT = CURRENT_DIR

OUTPUT_DIR = REPO_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MATPLOTLIB_CACHE_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", message="IProgress not found.*")
warnings.filterwarnings("ignore", message=".*X does not have valid feature names.*")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LassoCV, LinearRegression, LogisticRegressionCV
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_score, r2_score
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

import doubleml as dml
from doubleml import DoubleMLData, DoubleMLIIVM, DoubleMLIRM

warnings.filterwarnings("ignore", category=ConvergenceWarning)

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["axes.titleweight"] = "bold"

package_versions = pd.DataFrame(
    [
        {"package": "doubleml", "version": dml.__version__},
        {"package": "numpy", "version": np.__version__},
        {"package": "pandas", "version": pd.__version__},
        {"package": "seaborn", "version": sns.__version__},
        {"package": "statsmodels", "version": sm.__version__},
    ]
)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)

	package	version
0	doubleml	0.11.2
1	numpy	2.4.4
2	pandas	3.0.2
3	seaborn	0.13.2
4	statsmodels	0.14.6

The package table is saved as part of the reproducibility record. Binary IV workflows depend on classifier behavior, cross-fitting splits, and package defaults.

Helper Functions

The next cell defines utilities for saving tables, computing simple baselines, computing manual ratio scores, cross-fitting IIVM nuisance functions, and extracting DoubleML diagnostics.

The manual IIVM formula is the most important helper. It estimates a numerator effect of the instrument on the outcome and a denominator effect of the instrument on treatment, then takes their ratio.

def save_table(df, name):
    path = TABLE_DIR / f"{NOTEBOOK_PREFIX}_{name}.csv"
    df.to_csv(path, index=False)
    return df


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


def mean_difference_summary(y, group, label, target="difference"):
    group = np.asarray(group)
    y = np.asarray(y)
    high = y[group == 1]
    low = y[group == 0]
    theta_hat = float(high.mean() - low.mean())
    std_error = float(np.sqrt(high.var(ddof=1) / len(high) + low.var(ddof=1) / len(low)))
    return {
        "estimator": label,
        "target": target,
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
    }


def treatment_ols_summary(y, X, treatment_col, label):
    X_design = sm.add_constant(X, has_constant="add")
    fit = sm.OLS(y, X_design).fit(cov_type="HC1")
    row = fit.summary2().tables[1].loc[treatment_col]
    return {
        "estimator": label,
        "target": "observed-control treatment slope",
        "theta_hat": float(row["Coef."]),
        "std_error": float(row["Std.Err."]),
        "ci_95_lower": float(row["[0.025"]),
        "ci_95_upper": float(row["0.975]"]),
        "p_value": float(row["P>|z|"]),
    }


def wald_summary(y, d, z, label):
    y = np.asarray(y)
    d = np.asarray(d)
    z = np.asarray(z)
    numerator = y[z == 1].mean() - y[z == 0].mean()
    denominator = d[z == 1].mean() - d[z == 0].mean()
    theta_hat = float(numerator / denominator)
    influence = (z / z.mean()) * (y - theta_hat * d) - ((1 - z) / (1 - z).mean()) * (y - theta_hat * d)
    std_error = float(np.std(influence, ddof=1) / (abs(denominator) * np.sqrt(len(y))))
    return {
        "estimator": label,
        "target": "raw Wald LATE-style ratio",
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
        "wald_numerator": float(numerator),
        "wald_denominator": float(denominator),
    }


def residualized_wald_summary(y, d, z, X, label):
    y_resid = y - LinearRegression().fit(X, y).predict(X)
    d_resid = d - LinearRegression().fit(X, d).predict(X)
    z_resid = z - LinearRegression().fit(X, z).predict(X)
    numerator = float(np.mean(z_resid * y_resid))
    denominator = float(np.mean(z_resid * d_resid))
    theta_hat = numerator / denominator
    psi = z_resid * (y_resid - theta_hat * d_resid)
    std_error = float(np.sqrt(np.mean(psi ** 2) / (denominator ** 2 * len(y))))
    return {
        "estimator": label,
        "target": "linear residualized Wald ratio",
        "theta_hat": float(theta_hat),
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
        "wald_numerator": numerator,
        "wald_denominator": denominator,
    }


def iivm_ratio_summary(y, d, z, g0_hat, g1_hat, m_hat, r0_hat, r1_hat, label):
    y = np.asarray(y)
    d = np.asarray(d)
    z = np.asarray(z)
    m_hat = np.clip(np.asarray(m_hat), 0.02, 0.98)
    g0_hat = np.asarray(g0_hat)
    g1_hat = np.asarray(g1_hat)
    r0_hat = np.clip(np.asarray(r0_hat), 0.01, 0.99)
    r1_hat = np.clip(np.asarray(r1_hat), 0.01, 0.99)

    numerator_score = g1_hat - g0_hat + z * (y - g1_hat) / m_hat - (1 - z) * (y - g0_hat) / (1 - m_hat)
    denominator_score = r1_hat - r0_hat + z * (d - r1_hat) / m_hat - (1 - z) * (d - r0_hat) / (1 - m_hat)
    numerator = float(np.mean(numerator_score))
    denominator = float(np.mean(denominator_score))
    theta_hat = numerator / denominator
    ratio_score = numerator_score - theta_hat * denominator_score
    std_error = float(np.std(ratio_score, ddof=1) / (abs(denominator) * np.sqrt(len(y))))
    return {
        "estimator": label,
        "target": "LATE",
        "theta_hat": float(theta_hat),
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
        "iivm_numerator": numerator,
        "iivm_denominator": denominator,
    }


def crossfit_iivm_nuisances(X, y, d, z, outcome_learner, instrument_learner, treatment_learner, n_splits=5, random_state=RANDOM_STATE):
    cv = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    g0_hat = np.zeros(len(y))
    g1_hat = np.zeros(len(y))
    m_hat = np.zeros(len(y))
    r0_hat = np.zeros(len(y))
    r1_hat = np.zeros(len(y))

    for train_idx, test_idx in cv.split(X):
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        y_train = y[train_idx]
        d_train = d[train_idx]
        z_train = z[train_idx]

        g0_model = clone(outcome_learner).fit(X_train[z_train == 0], y_train[z_train == 0])
        g1_model = clone(outcome_learner).fit(X_train[z_train == 1], y_train[z_train == 1])
        m_model = clone(instrument_learner).fit(X_train, z_train)
        r0_model = clone(treatment_learner).fit(X_train[z_train == 0], d_train[z_train == 0])
        r1_model = clone(treatment_learner).fit(X_train[z_train == 1], d_train[z_train == 1])

        g0_hat[test_idx] = g0_model.predict(X_test)
        g1_hat[test_idx] = g1_model.predict(X_test)
        m_hat[test_idx] = m_model.predict_proba(X_test)[:, 1]
        r0_hat[test_idx] = r0_model.predict_proba(X_test)[:, 1]
        r1_hat[test_idx] = r1_model.predict_proba(X_test)[:, 1]

    return g0_hat, g1_hat, np.clip(m_hat, 0.02, 0.98), np.clip(r0_hat, 0.01, 0.99), np.clip(r1_hat, 0.01, 0.99)


def prediction_vector(doubleml_model, learner_key):
    arr = np.asarray(doubleml_model.predictions[learner_key])
    if arr.ndim != 3:
        raise ValueError(f"Expected a 3D prediction array, got shape {arr.shape}")
    return arr[:, 0, 0]


def learner_loss_table(doubleml_model, model_label):
    losses = doubleml_model.evaluate_learners()
    rows = []
    for learner_name, values in losses.items():
        arr = np.asarray(values)
        rows.append(
            {
                "model": model_label,
                "learner": learner_name,
                "mean_loss": float(np.mean(arr)),
                "min_loss": float(np.min(arr)),
                "max_loss": float(np.max(arr)),
            }
        )
    return pd.DataFrame(rows)

The manual ratio helper mirrors the IIVM target: an instrument-outcome effect divided by an instrument-treatment effect. The denominator is the first stage. If that denominator is small, the estimate becomes unstable.

Draw The Binary IV Design

The diagram below shows the core binary IV story.

The instrument Z shifts treatment D. Treatment affects outcome Y. Observed controls X may affect all three. Hidden factors can affect treatment and outcome, which is why observed-control IRM may fail. The IV design requires that the hidden factors do not also drive the instrument after conditioning on X, and that the instrument has no direct outcome path.

from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

nodes = {
    "X": {"xy": (0.12, 0.58), "label": "Observed controls\nX", "color": "#dbeafe"},
    "Z": {"xy": (0.36, 0.28), "label": "Binary instrument\nZ", "color": "#dcfce7"},
    "D": {"xy": (0.58, 0.52), "label": "Binary treatment\nD", "color": "#fef3c7"},
    "Y": {"xy": (0.84, 0.52), "label": "Outcome\nY", "color": "#fee2e2"},
    "U": {"xy": (0.58, 0.84), "label": "Hidden factor\nU", "color": "#f3f4f6"},
}

edge_specs = [
    ("X", "Z", "#334155", "solid", 0.00),
    ("X", "D", "#334155", "solid", 0.04),
    ("X", "Y", "#334155", "solid", -0.08),
    ("Z", "D", "#15803d", "solid", 0.00),
    ("D", "Y", "#b45309", "solid", 0.00),
    ("U", "D", "#6b7280", "dashed", 0.00),
    ("U", "Y", "#6b7280", "dashed", -0.10),
]

fig, ax = plt.subplots(figsize=(12, 6.2))
ax.set_axis_off()
box_w, box_h = 0.15, 0.105

def edge_endpoint(source_xy, target_xy, from_source=True):
    x0, y0 = source_xy
    x1, y1 = target_xy
    dx, dy = x1 - x0, y1 - y0
    scale = 1.0 / max(abs(dx) / (box_w / 2), abs(dy) / (box_h / 2))
    if from_source:
        return (x0 + dx * scale * 1.08, y0 + dy * scale * 1.08)
    return (x1 - dx * scale * 1.12, y1 - dy * scale * 1.12)

for spec in nodes.values():
    x, y_pos = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y_pos - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.018",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y_pos, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)

for start, end, color, style, rad in edge_specs:
    start_xy = edge_endpoint(nodes[start]["xy"], nodes[end]["xy"], from_source=True)
    end_xy = edge_endpoint(nodes[start]["xy"], nodes[end]["xy"], from_source=False)
    arrow = FancyArrowPatch(
        start_xy,
        end_xy,
        arrowstyle="-|>",
        mutation_scale=20,
        linewidth=1.8,
        color=color,
        linestyle=style,
        connectionstyle=f"arc3,rad={rad}",
        zorder=5,
    )
    ax.add_patch(arrow)

ax.text(
    0.50,
    0.08,
    "IIVM estimates the treatment effect for units whose treatment status is changed by the binary instrument.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("Interactive IV Model Teaching Design", pad=18)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_iivm_design_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The diagram makes clear why this is not the same as the IRM notebook. The instrument creates a local source of treatment variation, and the estimate is tied to that variation.

Create A Teaching Dataset With Noncompliance

We now simulate a binary instrument and a binary treatment.

The instrument is an encouragement. Some units always take treatment, some never take treatment, and some comply with the encouragement. There are no defiers, so monotonicity holds by construction.

The outcome depends on the actual treatment, observed controls, and a latent factor related to compliance type. This makes direct treatment comparisons biased, but the instrument remains valid after conditioning on observed controls.

rng = np.random.default_rng(RANDOM_STATE)
N_OBS = 5_000

feature_cols = [
    "engagement_score",
    "need_intensity",
    "content_fit",
    "recent_activity",
    "price_sensitivity",
    "tenure_signal",
    "novelty_appetite",
    "seasonality_signal",
]

X_raw = pd.DataFrame(rng.normal(size=(N_OBS, len(feature_cols))), columns=feature_cols)

true_instrument_logit = (
    -0.10
    + 0.55 * X_raw["engagement_score"]
    - 0.35 * X_raw["price_sensitivity"]
    + 0.25 * np.sin(X_raw["content_fit"])
    + 0.20 * X_raw["seasonality_signal"]
)
true_instrument_propensity = np.clip(sigmoid(true_instrument_logit), 0.08, 0.92)
encouragement = rng.binomial(1, true_instrument_propensity, size=N_OBS)

p_complier = (
    0.45
    + 0.15 * np.tanh(X_raw["need_intensity"])
    + 0.10 * np.tanh(X_raw["novelty_appetite"])
    - 0.08 * np.tanh(X_raw["price_sensitivity"])
)
p_always = 0.18 + 0.07 * np.tanh(X_raw["engagement_score"])
p_complier = np.clip(p_complier, 0.15, 0.70)
p_always = np.clip(p_always, 0.05, 0.35)
p_never = np.clip(1 - p_complier - p_always, 0.05, 0.80)
prob_sum = p_complier + p_always + p_never
p_complier = p_complier / prob_sum
p_always = p_always / prob_sum
p_never = p_never / prob_sum

compliance_draw = rng.uniform(size=N_OBS)
compliance_type = np.where(
    compliance_draw < p_always,
    "always_taker",
    np.where(compliance_draw < p_always + p_complier, "complier", "never_taker"),
)

feature_exposure = np.where(
    compliance_type == "always_taker",
    1,
    np.where(compliance_type == "never_taker", 0, encouragement),
)

latent_compliance_factor = (
    np.where(compliance_type == "always_taker", 0.50, np.where(compliance_type == "never_taker", -0.20, 0.00))
    + rng.normal(loc=0.0, scale=0.40, size=N_OBS)
)

true_mu0 = (
    0.70 * np.sin(X_raw["engagement_score"])
    + 0.45 * X_raw["need_intensity"] ** 2
    - 0.35 * X_raw["price_sensitivity"]
    + 0.25 * X_raw["content_fit"] * X_raw["recent_activity"]
    + 0.35 * latent_compliance_factor
)
true_tau = (
    0.80
    + 0.25 * sigmoid(X_raw["engagement_score"])
    + 0.25 * X_raw["need_intensity"]
    - 0.15 * X_raw["price_sensitivity"]
    + 0.20 * (X_raw["novelty_appetite"] > 0).astype(float)
)
true_mu1 = true_mu0 + true_tau
outcome_noise = rng.normal(loc=0.0, scale=1.0, size=N_OBS)
weekly_value = true_mu0 + true_tau * feature_exposure + outcome_noise

# Population-level IIVM targets under the synthetic design.
TRUE_ATE = float(np.mean(true_tau))
TRUE_REALIZED_COMPLIER_EFFECT = float(np.mean(true_tau[compliance_type == "complier"]))
TRUE_LATE = float(np.mean(p_complier * true_tau) / np.mean(p_complier))
TRUE_FIRST_STAGE = float(np.mean(p_complier))

# Oracle conditional mean functions for the IIVM ratio.
true_r0 = p_always
true_r1 = p_always + p_complier
true_g0 = true_mu0 + true_tau * true_r0
true_g1 = true_mu0 + true_tau * true_r1

# Potential treatment states are useful for teaching compliance types.
potential_d0 = np.where(compliance_type == "always_taker", 1, 0)
potential_d1 = np.where(compliance_type == "never_taker", 0, 1)

iivm_df = X_raw.assign(
    encouragement=encouragement,
    feature_exposure=feature_exposure,
    weekly_value=weekly_value,
    compliance_type=compliance_type,
    true_instrument_propensity=true_instrument_propensity,
    true_p_complier=p_complier,
    true_p_always=p_always,
    true_p_never=p_never,
    true_r0=true_r0,
    true_r1=true_r1,
    true_g0=true_g0,
    true_g1=true_g1,
    true_mu0=true_mu0,
    true_mu1=true_mu1,
    true_tau=true_tau,
    latent_compliance_factor=latent_compliance_factor,
    potential_d0=potential_d0,
    potential_d1=potential_d1,
)

iivm_df.to_csv(DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_iivm_data.csv", index=False)
print(f"Saved synthetic IIVM data with shape {iivm_df.shape}")
print(f"Instrument rate: {iivm_df['encouragement'].mean():.3f}")
print(f"Treatment rate: {iivm_df['feature_exposure'].mean():.3f}")
print(f"True ATE: {TRUE_ATE:.3f}")
print(f"True compliance-weighted LATE: {TRUE_LATE:.3f}")
print(f"True realized complier effect: {TRUE_REALIZED_COMPLIER_EFFECT:.3f}")
print(f"True first stage: {TRUE_FIRST_STAGE:.3f}")
display(iivm_df.head())

Saved synthetic IIVM data with shape (5000, 26)
Instrument rate: 0.463
Treatment rate: 0.392
True ATE: 1.021
True compliance-weighted LATE: 1.102
True realized complier effect: 1.098
True first stage: 0.448

	engagement_score	need_intensity	content_fit	recent_activity	price_sensitivity	tenure_signal	novelty_appetite	seasonality_signal	encouragement	feature_exposure	...	true_r0	true_r1	true_g0	true_g1	true_mu0	true_mu1	true_tau	latent_compliance_factor	potential_d1
0	-0.793122	0.240571	-1.896326	1.395772	0.638295	-0.292047	-0.311949	0.303835	0	0	...	0.133788	0.543873	-1.272362	-0.926958	-1.385047	-0.542774	0.842273	-0.077700	1
1	-0.267660	-0.225909	0.720068	0.514705	-0.064128	-0.085477	0.160916	-0.614018	0	0	...	0.161699	0.599455	0.076353	0.541036	-0.095292	0.966220	1.061512	-0.137787	0
2	-0.403750	0.548260	-0.130483	-1.374426	-0.477279	0.656622	-0.232283	-0.148733	0	0	...	0.153179	0.690767	0.238566	0.834622	0.068727	1.177487	1.108760	-0.009748	0
3	0.641837	1.824610	-0.713189	1.348207	-1.230013	0.174978	-1.169530	1.351458	0	0	...	0.219630	0.797019	2.368826	3.295216	2.016441	3.620887	1.604447	-0.259689	1
4	0.833923	1.137717	-0.885533	0.684555	-0.519013	-0.457385	0.506537	0.876718	1	1	...	0.227780	0.884706	1.558963	2.568380	1.208961	2.745538	1.536577	0.222783	1

5 rows × 26 columns

The true ATE and true local effect differ because the instrument-responsive units have a different treatment-effect mix than the full population. That is the key conceptual difference between IRM and IIVM.

Field Dictionary

This table documents the roles of the important columns. The hidden teaching columns are not available in real applications; they are present only so the tutorial can compare estimates with known synthetic truth.

field_dictionary = pd.DataFrame(
    [
        {"column": "weekly_value", "role": "outcome", "description": "Observed outcome Y."},
        {"column": "feature_exposure", "role": "binary treatment", "description": "Observed binary treatment D."},
        {"column": "encouragement", "role": "binary instrument", "description": "Observed binary instrument Z."},
        *[
            {"column": col, "role": "control", "description": "Observed pre-treatment control X."}
            for col in feature_cols
        ],
        {"column": "compliance_type", "role": "hidden teaching column", "description": "Realized synthetic compliance type; not observed in ordinary IV data."},
        {"column": "true_instrument_propensity", "role": "hidden teaching column", "description": "True P(Z=1 | X)."},
        {"column": "true_p_complier", "role": "hidden teaching column", "description": "True conditional probability of complier type."},
        {"column": "true_p_always", "role": "hidden teaching column", "description": "True conditional probability of always-taker type."},
        {"column": "true_p_never", "role": "hidden teaching column", "description": "True conditional probability of never-taker type."},
        {"column": "true_r0", "role": "hidden teaching column", "description": "True P(D=1 | Z=0, X)."},
        {"column": "true_r1", "role": "hidden teaching column", "description": "True P(D=1 | Z=1, X)."},
        {"column": "true_g0", "role": "hidden teaching column", "description": "True E[Y | Z=0, X]."},
        {"column": "true_g1", "role": "hidden teaching column", "description": "True E[Y | Z=1, X]."},
        {"column": "true_tau", "role": "hidden teaching column", "description": "True individual treatment effect."},
        {"column": "latent_compliance_factor", "role": "hidden confounding risk", "description": "Latent factor related to treatment and outcome; handled by IV design in the simulation."},
    ]
)
save_table(field_dictionary, "field_dictionary")
display(field_dictionary)

	column	role	description
0	weekly_value	outcome	Observed outcome Y.
1	feature_exposure	binary treatment	Observed binary treatment D.
2	encouragement	binary instrument	Observed binary instrument Z.
3	engagement_score	control	Observed pre-treatment control X.
4	need_intensity	control	Observed pre-treatment control X.
5	content_fit	control	Observed pre-treatment control X.
6	recent_activity	control	Observed pre-treatment control X.
7	price_sensitivity	control	Observed pre-treatment control X.
8	tenure_signal	control	Observed pre-treatment control X.
9	novelty_appetite	control	Observed pre-treatment control X.
10	seasonality_signal	control	Observed pre-treatment control X.
11	compliance_type	hidden teaching column	Realized synthetic compliance type; not observ...
12	true_instrument_propensity	hidden teaching column	True P(Z=1 \| X).
13	true_p_complier	hidden teaching column	True conditional probability of complier type.
14	true_p_always	hidden teaching column	True conditional probability of always-taker t...
15	true_p_never	hidden teaching column	True conditional probability of never-taker type.
16	true_r0	hidden teaching column	True P(D=1 \| Z=0, X).
17	true_r1	hidden teaching column	True P(D=1 \| Z=1, X).
18	true_g0	hidden teaching column	True E[Y \| Z=0, X].
19	true_g1	hidden teaching column	True E[Y \| Z=1, X].
20	true_tau	hidden teaching column	True individual treatment effect.
21	latent_compliance_factor	hidden confounding risk	Latent factor related to treatment and outcome...

The instrument belongs in z_cols, not in ordinary controls. The compliance type is hidden and should not be used as a model feature.

Basic Data Audit

Before modeling, we check treatment support, instrument support, missingness, and the known synthetic targets.

Binary IV needs variation in both Z and D. It also needs the instrument to move treatment, not merely exist as a column.

audit_cols = ["weekly_value", "feature_exposure", "encouragement", "true_instrument_propensity", "true_p_complier", "true_tau"] + feature_cols

data_audit = pd.DataFrame(
    [
        {
            "column": column,
            "missing_rate": float(iivm_df[column].isna().mean()),
            "mean": float(iivm_df[column].mean()),
            "std": float(iivm_df[column].std()),
            "min": float(iivm_df[column].min()),
            "p05": float(iivm_df[column].quantile(0.05)),
            "median": float(iivm_df[column].median()),
            "p95": float(iivm_df[column].quantile(0.95)),
            "max": float(iivm_df[column].max()),
            "n_unique": int(iivm_df[column].nunique()),
        }
        for column in audit_cols
    ]
)
save_table(data_audit, "data_audit")
display(data_audit)

target_summary = pd.DataFrame(
    [
        {"target": "Instrument rate", "value": float(iivm_df["encouragement"].mean())},
        {"target": "Treatment rate", "value": float(iivm_df["feature_exposure"].mean())},
        {"target": "True ATE", "value": TRUE_ATE},
        {"target": "True compliance-weighted LATE", "value": TRUE_LATE},
        {"target": "True realized complier effect", "value": TRUE_REALIZED_COMPLIER_EFFECT},
        {"target": "True first stage", "value": TRUE_FIRST_STAGE},
    ]
)
save_table(target_summary, "target_summary")
display(target_summary)

	column	mean	std	min	p05	median	p95	max	n_unique
0	weekly_value	0.890766	1.571602	-3.906167	-1.515032	0.790934	3.637754	8.031673	5000
1	feature_exposure	0.392400	0.488334	0.000000	0.000000	0.000000	1.000000	1.000000	2
2	encouragement	0.462600	0.498649	0.000000	0.000000	0.000000	1.000000	1.000000	2
3	true_instrument_propensity	0.478380	0.157356	0.080000	0.218420	0.475840	0.741112	0.889678	5000
4	true_p_complier	0.447774	0.125234	0.150000	0.240254	0.449046	0.654633	0.700000	4903
5	true_tau	1.020535	0.315500	-0.176119	0.507569	1.024176	1.535050	2.054797	5000
6	engagement_score	-0.008521	1.006351	-3.659392	-1.686316	-0.007879	1.622752	4.013439	5000
7	need_intensity	-0.009919	0.996068	-3.635159	-1.614872	-0.007694	1.645762	3.527216	5000
8	content_fit	0.008053	0.993164	-3.987716	-1.627572	0.010408	1.622390	3.600337	5000
9	recent_activity	-0.008282	0.993804	-3.573864	-1.605342	-0.000571	1.648631	3.653308	5000
10	price_sensitivity	-0.010295	1.003277	-3.747414	-1.678895	0.002322	1.612166	3.566084	5000
11	tenure_signal	0.029898	1.001422	-4.012688	-1.634641	0.039708	1.693146	3.667511	5000
12	novelty_appetite	-0.003217	0.981203	-4.044650	-1.572431	-0.039732	1.643580	3.269031	5000
13	seasonality_signal	0.011097	1.008479	-3.372559	-1.618357	0.015585	1.678134	3.620951	5000

	target	value
0	Instrument rate	0.462600
1	Treatment rate	0.392400
2	True ATE	1.020535
3	True compliance-weighted LATE	1.101928
4	True realized complier effect	1.097708
5	True first stage	0.447774

The instrument and treatment both have adequate support. The true first stage is large enough for a stable teaching example.

Compliance Type Mix

The next table summarizes the synthetic compliance types. In real data, we generally cannot label individuals this way because we observe only one instrument state per unit.

The table is useful for intuition: only compliers identify the local treatment effect under the standard monotonic IV story.

compliance_summary = (
    iivm_df.groupby("compliance_type")
    .agg(
        n=("weekly_value", "size"),
        share=("weekly_value", lambda s: len(s) / len(iivm_df)),
        treatment_rate=("feature_exposure", "mean"),
        instrument_rate=("encouragement", "mean"),
        mean_true_tau=("true_tau", "mean"),
        mean_outcome=("weekly_value", "mean"),
    )
    .reset_index()
)
save_table(compliance_summary, "compliance_type_summary")
display(compliance_summary)

	compliance_type	n	share	treatment_rate	instrument_rate	mean_true_tau	mean_outcome
0	always_taker	872	0.1744	1.000000	0.503440	1.027969	1.813908
1	complier	2312	0.4624	0.471453	0.471453	1.097708	1.028025
2	never_taker	1816	0.3632	0.000000	0.431718	0.918716	0.272747

The mean true effect differs across compliance types. This is why the local effect can differ from the full-population ATE.

Visualize Compliance And Instrument Take-Up

The left panel shows treatment rates by instrument status. The right panel shows the true individual effects by hidden compliance type.

A binary IV design needs the instrument to create a treatment-rate gap. The effect distribution reminds us that the local target can differ from the population average.

from matplotlib.lines import Line2D

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

takeup = (
    iivm_df.groupby("encouragement")
    .agg(treatment_rate=("feature_exposure", "mean"), n=("feature_exposure", "size"))
    .reset_index()
)
sns.barplot(data=takeup, x="encouragement", y="treatment_rate", color="#60a5fa", ax=axes[0])
axes[0].set_title("Treatment Take-Up By Instrument Status")
axes[0].set_xlabel("Instrument Z")
axes[0].set_ylabel("Treatment Rate")
axes[0].set_ylim(0, 1)

# Seaborn's automatic hue legend can be overwritten after adding Matplotlib reference lines.
# We draw the KDE curves without an automatic legend, then build one combined legend manually.
compliance_order = ["complier", "always_taker", "never_taker"]
compliance_order = [group for group in compliance_order if group in iivm_df["compliance_type"].unique()]
compliance_palette = dict(zip(compliance_order, sns.color_palette("Set2", n_colors=len(compliance_order))))

sns.kdeplot(
    data=iivm_df,
    x="true_tau",
    hue="compliance_type",
    hue_order=compliance_order,
    palette=compliance_palette,
    common_norm=False,
    linewidth=2,
    legend=False,
    ax=axes[1],
)
axes[1].axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5)
axes[1].axvline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5)
axes[1].set_title("Treatment Effect Heterogeneity By Compliance Type")
axes[1].set_xlabel("True Individual Effect")

legend_handles = [
    Line2D([0], [0], color=compliance_palette[group], linewidth=2, label=group.replace("_", " ").title())
    for group in compliance_order
]
legend_handles.extend(
    [
        Line2D([0], [0], color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE"),
        Line2D([0], [0], color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE"),
    ]
)
axes[1].legend(handles=legend_handles, title="Curve / reference", frameon=True, loc="best")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_compliance_and_take_up.png", dpi=160, bbox_inches="tight")
plt.show()

The instrument creates a clear treatment-rate difference. The effect distributions show why the local effect is not just a technicality; it can be a meaningfully different causal target.

Instrument Overlap

The instrument propensity should not be too close to zero or one. If some covariate profiles almost always receive the same instrument status, there is little local comparison for those profiles.

This cell summarizes the true instrument propensity and displays its distribution by observed instrument status.

instrument_overlap_summary = pd.DataFrame(
    [
        {"diagnostic": "instrument propensity minimum", "value": float(iivm_df["true_instrument_propensity"].min())},
        {"diagnostic": "instrument propensity p05", "value": float(iivm_df["true_instrument_propensity"].quantile(0.05))},
        {"diagnostic": "instrument propensity median", "value": float(iivm_df["true_instrument_propensity"].median())},
        {"diagnostic": "instrument propensity p95", "value": float(iivm_df["true_instrument_propensity"].quantile(0.95))},
        {"diagnostic": "instrument propensity maximum", "value": float(iivm_df["true_instrument_propensity"].max())},
    ]
)
save_table(instrument_overlap_summary, "instrument_overlap_summary")
display(instrument_overlap_summary)

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
    data=iivm_df,
    x="true_instrument_propensity",
    hue="encouragement",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.35,
    ax=ax,
)
ax.set_title("True Instrument Propensity By Instrument Status")
ax.set_xlabel("True P(Z=1 | X)")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_instrument_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

	diagnostic	value
0	instrument propensity minimum	0.080000
1	instrument propensity p05	0.218420
2	instrument propensity median	0.475840
3	instrument propensity p95	0.741112
4	instrument propensity maximum	0.889678

The instrument overlap is healthy in this simulation. In real data, poor instrument overlap can make the local effect unstable or limit the population for which the estimate is meaningful.

Correlation Screen

The correlation matrix gives a quick view of relationships among outcome, treatment, instrument, compliance probabilities, and controls.

Correlation is not identification. It is just a useful screen before fitting models.

corr_cols = [
    "weekly_value",
    "feature_exposure",
    "encouragement",
    "true_instrument_propensity",
    "true_p_complier",
    "true_tau",
    "latent_compliance_factor",
] + feature_cols
corr_matrix = iivm_df[corr_cols].corr()
save_table(corr_matrix.reset_index().rename(columns={"index": "column"}), "design_correlation_matrix")

fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(
    corr_matrix,
    cmap="vlag",
    center=0,
    annot=False,
    linewidths=0.4,
    cbar_kws={"label": "Correlation"},
    ax=ax,
)
ax.set_title("Correlation Screen For IIVM Teaching Data")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_design_correlation_matrix.png", dpi=160, bbox_inches="tight")
plt.show()

The treatment is related to the latent compliance factor, which is why direct treatment comparisons are biased. The instrument is designed to be usable after adjustment for controls.

Baseline Estimators

We compare several simple estimators before fitting DoubleML:

Difference in means by treatment status.
OLS adjustment using treatment and observed controls.
Raw Wald ratio using instrument groups without controls.
Linear residualized Wald ratio.
Oracle IIVM ratio using the true nuisance functions.

The oracle estimator is available only in this synthetic notebook.

y = iivm_df["weekly_value"].to_numpy()
d = iivm_df["feature_exposure"].to_numpy()
z = iivm_df["encouragement"].to_numpy()
X = iivm_df[feature_cols]

baseline_rows = []
baseline_rows.append(mean_difference_summary(y, d, "Difference in means by treatment", target="direct treatment comparison"))
baseline_rows.append(treatment_ols_summary(iivm_df["weekly_value"], iivm_df[["feature_exposure"] + feature_cols], "feature_exposure", "OLS with raw controls"))
baseline_rows.append(wald_summary(y, d, z, "Raw Wald ratio"))
baseline_rows.append(residualized_wald_summary(y, d, z, X, "Linear residualized Wald ratio"))
baseline_rows.append(
    iivm_ratio_summary(
        y=y,
        d=d,
        z=z,
        g0_hat=iivm_df["true_g0"].to_numpy(),
        g1_hat=iivm_df["true_g1"].to_numpy(),
        m_hat=iivm_df["true_instrument_propensity"].to_numpy(),
        r0_hat=iivm_df["true_r0"].to_numpy(),
        r1_hat=iivm_df["true_r1"].to_numpy(),
        label="Oracle IIVM ratio",
    )
)

baseline_estimates = pd.DataFrame(baseline_rows)
baseline_estimates["true_target"] = np.where(
    baseline_estimates["target"].str.contains("treatment", case=False, na=False),
    TRUE_ATE,
    TRUE_LATE,
)
baseline_estimates.loc[baseline_estimates["estimator"].eq("OLS with raw controls"), "true_target"] = TRUE_ATE
baseline_estimates["bias_vs_target"] = baseline_estimates["theta_hat"] - baseline_estimates["true_target"]
save_table(baseline_estimates, "baseline_estimates")
display(baseline_estimates)

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	wald_numerator	wald_denominator	iivm_numerator	iivm_denominator	true_target	bias_vs_target
0	Difference in means by treatment	direct treatment comparison	1.523229	0.040880	1.443103	1.603354	NaN	NaN	NaN	NaN	NaN	1.020535	0.502693
1	OLS with raw controls	observed-control treatment slope	1.223530	0.037648	1.149742	1.297318	1.082914e-231	NaN	NaN	NaN	NaN	1.020535	0.202994
2	Raw Wald ratio	raw Wald LATE-style ratio	1.849634	0.080037	1.692762	2.006506	NaN	0.924632	0.499900	NaN	NaN	1.101928	0.747706
3	Linear residualized Wald ratio	linear residualized Wald ratio	1.178991	0.078897	1.024353	1.333629	NaN	0.124390	0.105505	NaN	NaN	1.101928	0.077064
4	Oracle IIVM ratio	LATE	1.116161	0.064275	0.990182	1.242139	NaN	NaN	NaN	0.527789	0.472861	1.101928	0.014233

Direct treatment comparisons are biased because treatment selection is related to latent compliance factors. The raw Wald ratio is also off because the instrument depends on observed controls. Adjustment matters for the binary IV design.

Baseline Estimate Plot

The plot compares baseline estimates against two reference lines: the true ATE and the true local effect.

The local effect is the correct reference for IV-style rows. The ATE is the reference for direct treatment-comparison rows.

plot_baselines = baseline_estimates.copy().sort_values("theta_hat")
plot_baselines["lower_error"] = plot_baselines["theta_hat"] - plot_baselines["ci_95_lower"]
plot_baselines["upper_error"] = plot_baselines["ci_95_upper"] - plot_baselines["theta_hat"]

fig, ax = plt.subplots(figsize=(10.8, 5.8))
y_pos = np.arange(len(plot_baselines))
ax.errorbar(
    x=plot_baselines["theta_hat"],
    y=y_pos,
    xerr=[plot_baselines["lower_error"], plot_baselines["upper_error"]],
    fmt="o",
    color="#0f172a",
    ecolor="#475569",
    capsize=4,
)
ax.axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE")
ax.axvline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE")
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_baselines["estimator"])
ax.set_xlabel("Estimated Effect")
ax.set_title("Baseline Binary IV Estimates")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_baseline_estimates.png", dpi=160, bbox_inches="tight")
plt.show()

The figure shows why IIVM has its own notebook. Binary IV estimates answer a local instrument-induced question, not the same question as a binary-treatment IRM ATE.

Nuisance Learners

DoubleMLIIVM uses three nuisance learner roles:

ml_g: outcome regression as a function of instrument status and controls.
ml_m: instrument propensity classifier, P(Z=1 | X).
ml_r: treatment take-up classifier, P(D=1 | Z, X).

We compare a linear/logistic baseline with gradient boosting. The outcome nuisance is continuous, while the instrument and treatment nuisances are classifiers.

linear_outcome_learner = make_pipeline(
    StandardScaler(),
    LassoCV(cv=3, random_state=RANDOM_STATE, max_iter=5_000),
)
linear_instrument_learner = make_pipeline(
    StandardScaler(),
    LogisticRegressionCV(cv=3, max_iter=2_000, scoring="neg_log_loss", random_state=RANDOM_STATE),
)
linear_treatment_learner = make_pipeline(
    StandardScaler(),
    LogisticRegressionCV(cv=3, max_iter=2_000, scoring="neg_log_loss", random_state=RANDOM_STATE),
)

hgb_outcome_learner = HistGradientBoostingRegressor(
    max_iter=180,
    learning_rate=0.05,
    max_leaf_nodes=24,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
hgb_instrument_learner = HistGradientBoostingClassifier(
    max_iter=180,
    learning_rate=0.05,
    max_leaf_nodes=24,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
hgb_treatment_learner = HistGradientBoostingClassifier(
    max_iter=180,
    learning_rate=0.05,
    max_leaf_nodes=24,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)

learner_catalog = pd.DataFrame(
    [
        {"learner_name": "Linear/logistic nuisance set", "use_case": "Transparent baseline; may underfit nonlinear compliance and outcome patterns."},
        {"learner_name": "Gradient-boosted nuisance set", "use_case": "Captures nonlinear instrument, treatment, and outcome nuisance functions."},
    ]
)
save_table(learner_catalog, "learner_catalog")
display(learner_catalog)

	learner_name	use_case
0	Linear/logistic nuisance set	Transparent baseline; may underfit nonlinear c...
1	Gradient-boosted nuisance set	Captures nonlinear instrument, treatment, and ...

The nuisance learners estimate the components of the ratio score. The final target remains the local treatment effect, not the predictive performance itself.

Manual Cross-Fitted IIVM

Before fitting DoubleML, we manually compute the cross-fitted IIVM ratio.

The numerator estimates how the instrument changes the outcome after adjusting for controls. The denominator estimates how the instrument changes treatment after adjusting for controls. Their ratio estimates the local treatment effect.

manual_g0_hat, manual_g1_hat, manual_m_hat, manual_r0_hat, manual_r1_hat = crossfit_iivm_nuisances(
    X=X,
    y=y,
    d=d,
    z=z,
    outcome_learner=hgb_outcome_learner,
    instrument_learner=hgb_instrument_learner,
    treatment_learner=hgb_treatment_learner,
    n_splits=5,
    random_state=RANDOM_STATE,
)

manual_iivm_estimate = pd.DataFrame(
    [
        iivm_ratio_summary(
            y=y,
            d=d,
            z=z,
            g0_hat=manual_g0_hat,
            g1_hat=manual_g1_hat,
            m_hat=manual_m_hat,
            r0_hat=manual_r0_hat,
            r1_hat=manual_r1_hat,
            label="Manual cross-fitted IIVM",
        )
    ]
)
manual_iivm_estimate["true_target"] = TRUE_LATE
manual_iivm_estimate["bias_vs_target"] = manual_iivm_estimate["theta_hat"] - TRUE_LATE
save_table(manual_iivm_estimate, "manual_crossfit_iivm_estimate")
display(manual_iivm_estimate)

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	iivm_numerator	iivm_denominator	true_target	bias_vs_target
0	Manual cross-fitted IIVM	LATE	1.185114	0.081281	1.025803	1.344424	NaN	0.549075	0.46331	1.101928	0.083186

The manual cross-fitted ratio is the IIVM idea in plain Python. DoubleML automates the same structure with score management, inference, repeated splits, and diagnostics.

Manual Nuisance Quality

Since this is synthetic data, we can compare cross-fitted nuisance predictions to the true nuisance functions.

Real applications cannot do this. They should use out-of-fold loss, instrument overlap, first-stage diagnostics, and design evidence.

manual_nuisance_quality = pd.DataFrame(
    [
        {
            "nuisance": "g0(X) = E[Y | Z=0, X]",
            "rmse_vs_truth": rmse(iivm_df["true_g0"], manual_g0_hat),
            "mae_vs_truth": float(mean_absolute_error(iivm_df["true_g0"], manual_g0_hat)),
            "correlation_vs_truth": float(np.corrcoef(iivm_df["true_g0"], manual_g0_hat)[0, 1]),
        },
        {
            "nuisance": "g1(X) = E[Y | Z=1, X]",
            "rmse_vs_truth": rmse(iivm_df["true_g1"], manual_g1_hat),
            "mae_vs_truth": float(mean_absolute_error(iivm_df["true_g1"], manual_g1_hat)),
            "correlation_vs_truth": float(np.corrcoef(iivm_df["true_g1"], manual_g1_hat)[0, 1]),
        },
        {
            "nuisance": "m(X) = P(Z=1 | X)",
            "rmse_vs_truth": rmse(iivm_df["true_instrument_propensity"], manual_m_hat),
            "mae_vs_truth": float(mean_absolute_error(iivm_df["true_instrument_propensity"], manual_m_hat)),
            "correlation_vs_truth": float(np.corrcoef(iivm_df["true_instrument_propensity"], manual_m_hat)[0, 1]),
            "auc_for_instrument": float(roc_auc_score(z, manual_m_hat)),
            "log_loss_for_instrument": float(log_loss(z, manual_m_hat)),
        },
        {
            "nuisance": "r0(X) = P(D=1 | Z=0, X)",
            "rmse_vs_truth": rmse(iivm_df["true_r0"], manual_r0_hat),
            "mae_vs_truth": float(mean_absolute_error(iivm_df["true_r0"], manual_r0_hat)),
            "correlation_vs_truth": float(np.corrcoef(iivm_df["true_r0"], manual_r0_hat)[0, 1]),
        },
        {
            "nuisance": "r1(X) = P(D=1 | Z=1, X)",
            "rmse_vs_truth": rmse(iivm_df["true_r1"], manual_r1_hat),
            "mae_vs_truth": float(mean_absolute_error(iivm_df["true_r1"], manual_r1_hat)),
            "correlation_vs_truth": float(np.corrcoef(iivm_df["true_r1"], manual_r1_hat)[0, 1]),
        },
    ]
)
save_table(manual_nuisance_quality, "manual_nuisance_quality")
display(manual_nuisance_quality)

	nuisance	rmse_vs_truth	mae_vs_truth	correlation_vs_truth	auc_for_instrument	log_loss_for_instrument
0	g0(X) = E[Y \| Z=0, X]	0.432298	0.332140	0.893133	NaN	NaN
1	g1(X) = E[Y \| Z=1, X]	0.472638	0.360051	0.906499	NaN	NaN
2	m(X) = P(Z=1 \| X)	0.115957	0.092211	0.807057	0.649979	0.668999
3	r0(X) = P(D=1 \| Z=0, X)	0.117598	0.094287	0.394385	NaN	NaN
4	r1(X) = P(D=1 \| Z=1, X)	0.155754	0.126420	0.674081	NaN	NaN

The nuisance predictions are useful but imperfect. The local ratio estimate is designed to be less sensitive to small nuisance errors, but weak nuisance quality can still create instability.

Manual First-Stage Diagnostics

The denominator of the IIVM ratio is the estimated instrument effect on treatment. This is the first stage.

If the denominator is close to zero, the local effect becomes unstable. This cell summarizes the manual cross-fitted first stage.

manual_first_stage = manual_r1_hat - manual_r0_hat
manual_first_stage_summary = pd.DataFrame(
    [
        {
            "diagnostic": "mean first stage r1_hat - r0_hat",
            "value": float(np.mean(manual_first_stage)),
        },
        {"diagnostic": "first-stage p05", "value": float(np.quantile(manual_first_stage, 0.05))},
        {"diagnostic": "first-stage median", "value": float(np.median(manual_first_stage))},
        {"diagnostic": "first-stage p95", "value": float(np.quantile(manual_first_stage, 0.95))},
        {"diagnostic": "share with negative estimated first stage", "value": float(np.mean(manual_first_stage < 0))},
    ]
)
save_table(manual_first_stage_summary, "manual_first_stage_summary")
display(manual_first_stage_summary)

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(manual_first_stage, bins=50, color="#047857", alpha=0.75, ax=ax)
ax.axvline(0, color="#111827", linestyle="--", linewidth=1)
ax.axvline(TRUE_FIRST_STAGE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True average first stage")
ax.set_title("Cross-Fitted First-Stage Distribution")
ax.set_xlabel("r1_hat(X) - r0_hat(X)")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_manual_first_stage_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

	diagnostic	value
0	mean first stage r1_hat - r0_hat	0.505220
1	first-stage p05	0.089992
2	first-stage median	0.528812
3	first-stage p95	0.846567
4	share with negative estimated first stage	0.023400

The first stage is mostly positive and comfortably away from zero on average. That fits the monotonic noncompliance design used in the simulation.

Build The DoubleML Data Object

DoubleMLIIVM needs a DoubleMLData object with:

outcome column weekly_value;
treatment column feature_exposure;
instrument column encouragement;
observed controls.

The hidden compliance and potential-outcome columns are excluded.

dml_data = DoubleMLData(
    iivm_df[["weekly_value", "feature_exposure", "encouragement"] + feature_cols],
    y_col="weekly_value",
    d_cols="feature_exposure",
    x_cols=feature_cols,
    z_cols="encouragement",
)

backend_summary = pd.DataFrame(
    [
        {"role": "outcome", "columns": dml_data.y_col},
        {"role": "treatment", "columns": ", ".join(dml_data.d_cols)},
        {"role": "instrument", "columns": ", ".join(dml_data.z_cols)},
        {"role": "controls", "columns": ", ".join(dml_data.x_cols)},
        {"role": "number_of_observations", "columns": str(dml_data.n_obs)},
    ]
)
save_table(backend_summary, "doublemldata_backend_summary")
display(backend_summary)

	role	columns
0	outcome	weekly_value
1	treatment	feature_exposure
2	instrument	encouragement
3	controls	engagement_score, need_intensity, content_fit,...
4	number_of_observations	5000

The instrument role is explicit through z_cols. This is the key data-backend difference between IRM and IIVM.

Fit `DoubleMLIIVM`

We now fit two IIVM specifications:

a linear/logistic nuisance baseline;
a gradient-boosted nuisance model.

Both use the default score="LATE" and five-fold cross-fitting.

iivm_model_specs = {
    "Linear nuisance IIVM": {
        "ml_g": linear_outcome_learner,
        "ml_m": linear_instrument_learner,
        "ml_r": linear_treatment_learner,
    },
    "Gradient boosting nuisance IIVM": {
        "ml_g": hgb_outcome_learner,
        "ml_m": hgb_instrument_learner,
        "ml_r": hgb_treatment_learner,
    },
}

iivm_models = {}
for model_name, spec in iivm_model_specs.items():
    model = DoubleMLIIVM(
        dml_data,
        ml_g=clone(spec["ml_g"]),
        ml_m=clone(spec["ml_m"]),
        ml_r=clone(spec["ml_r"]),
        n_folds=5,
        n_rep=1,
        score="LATE",
    )
    model.fit(store_predictions=True)
    iivm_models[model_name] = model
    print(f"Finished: {model_name}")

iivm_summary_rows = []
for model_name, model in iivm_models.items():
    summary = model.summary.reset_index().rename(columns={"index": "treatment"})
    row = summary.iloc[0].to_dict()
    iivm_summary_rows.append(
        {
            "estimator": model_name,
            "treatment": row["treatment"],
            "theta_hat": row["coef"],
            "std_error": row["std err"],
            "t_stat": row["t"],
            "p_value": row["P>|t|"],
            "ci_95_lower": row["2.5 %"],
            "ci_95_upper": row["97.5 %"],
            "true_target": TRUE_LATE,
            "bias_vs_target": row["coef"] - TRUE_LATE,
        }
    )

iivm_summary = pd.DataFrame(iivm_summary_rows)
save_table(iivm_summary, "doubleml_iivm_summary")
display(iivm_summary)

Finished: Linear nuisance IIVM
Finished: Gradient boosting nuisance IIVM

	estimator	treatment	theta_hat	std_error	t_stat	p_value	ci_95_lower	ci_95_upper	true_target	bias_vs_target
0	Linear nuisance IIVM	feature_exposure	1.166017	0.080283	14.523824	8.559872e-48	1.008666	1.323369	1.101928	0.064090
1	Gradient boosting nuisance IIVM	feature_exposure	1.206022	0.081519	14.794455	1.590685e-49	1.046249	1.365795	1.101928	0.104094

The gradient-boosted model is expected to do better because the synthetic compliance and outcome functions are nonlinear. The linear/logistic model is still useful as a transparent baseline.

Compare All Estimators

This table combines direct treatment comparisons, Wald-style baselines, the manual cross-fitted IIVM estimate, and DoubleML estimates.

The key comparison is whether an estimator targets the full-population treatment effect or the local instrument-induced effect.

all_estimates = pd.concat(
    [
        baseline_estimates[["estimator", "target", "theta_hat", "std_error", "ci_95_lower", "ci_95_upper", "p_value", "true_target", "bias_vs_target"]],
        manual_iivm_estimate[["estimator", "target", "theta_hat", "std_error", "ci_95_lower", "ci_95_upper", "p_value", "true_target", "bias_vs_target"]],
        iivm_summary.assign(target="LATE")[["estimator", "target", "theta_hat", "std_error", "ci_95_lower", "ci_95_upper", "p_value", "true_target", "bias_vs_target"]],
    ],
    ignore_index=True,
)
all_estimates["abs_bias_vs_target"] = all_estimates["bias_vs_target"].abs()
save_table(all_estimates, "all_estimator_comparison")
display(all_estimates.sort_values("abs_bias_vs_target"))

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	true_target	bias_vs_target	abs_bias_vs_target
4	Oracle IIVM ratio	LATE	1.116161	0.064275	0.990182	1.242139	NaN	1.101928	0.014233	0.014233
6	Linear nuisance IIVM	LATE	1.166017	0.080283	1.008666	1.323369	8.559872e-48	1.101928	0.064090	0.064090
3	Linear residualized Wald ratio	linear residualized Wald ratio	1.178991	0.078897	1.024353	1.333629	NaN	1.101928	0.077064	0.077064
5	Manual cross-fitted IIVM	LATE	1.185114	0.081281	1.025803	1.344424	NaN	1.101928	0.083186	0.083186
7	Gradient boosting nuisance IIVM	LATE	1.206022	0.081519	1.046249	1.365795	1.590685e-49	1.101928	0.104094	0.104094
1	OLS with raw controls	observed-control treatment slope	1.223530	0.037648	1.149742	1.297318	1.082914e-231	1.020535	0.202994	0.202994
0	Difference in means by treatment	direct treatment comparison	1.523229	0.040880	1.443103	1.603354	NaN	1.020535	0.502693	0.502693
2	Raw Wald ratio	raw Wald LATE-style ratio	1.849634	0.080037	1.692762	2.006506	NaN	1.101928	0.747706	0.747706

The adjusted local-effect estimators are much closer to the true local target than direct treatment comparisons. The table also shows why the ATE and local target should not be mixed casually.

Estimate Comparison Plot

The figure compares estimates with confidence intervals. The blue line marks the true ATE. The red line marks the true local effect.

IIVM rows should be judged against the red line, not the blue line.

plot_estimates = all_estimates.copy().sort_values("theta_hat")
plot_estimates["lower_error"] = plot_estimates["theta_hat"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["theta_hat"]

fig, ax = plt.subplots(figsize=(11, 6.8))
y_pos = np.arange(len(plot_estimates))
ax.errorbar(
    x=plot_estimates["theta_hat"],
    y=y_pos,
    xerr=[plot_estimates["lower_error"], plot_estimates["upper_error"]],
    fmt="o",
    color="#0f172a",
    ecolor="#475569",
    capsize=4,
)
ax.axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE")
ax.axvline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE")
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_estimates["estimator"])
ax.set_xlabel("Estimated Effect")
ax.set_title("IIVM Estimate Comparison")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_iivm_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The plot reinforces the main lesson: binary IV is about the treatment effect for the instrument-induced margin, not necessarily the whole population.

Nuisance Losses

evaluate_learners() reports out-of-fold nuisance losses. For IIVM, the prediction keys are:

ml_g0: outcome nuisance for Z=0.
ml_g1: outcome nuisance for Z=1.
ml_m: instrument propensity nuisance.
ml_r0: treatment take-up nuisance for Z=0.
ml_r1: treatment take-up nuisance for Z=1.

These are diagnostics for the nuisance stage, not proofs of instrument validity.

loss_tables = []
for model_name, model in iivm_models.items():
    loss_tables.append(learner_loss_table(model, model_name))

nuisance_losses = pd.concat(loss_tables, ignore_index=True)
save_table(nuisance_losses, "doubleml_nuisance_losses")
display(nuisance_losses)

	model	learner	mean_loss	min_loss	max_loss
0	Linear nuisance IIVM	ml_g0	1.319685	1.319685	1.319685
1	Linear nuisance IIVM	ml_g1	1.356408	1.356408	1.356408
2	Linear nuisance IIVM	ml_m	0.473152	0.473152	0.473152
3	Linear nuisance IIVM	ml_r0	0.365250	0.365250	0.365250
4	Linear nuisance IIVM	ml_r1	0.455724	0.455724	0.455724
5	Gradient boosting nuisance IIVM	ml_g0	1.196010	1.196010	1.196010
6	Gradient boosting nuisance IIVM	ml_g1	1.212799	1.212799	1.212799
7	Gradient boosting nuisance IIVM	ml_m	0.483769	0.483769	0.483769
8	Gradient boosting nuisance IIVM	ml_r0	0.379648	0.379648	0.379648
9	Gradient boosting nuisance IIVM	ml_r1	0.476144	0.476144	0.476144

The loss table helps compare nuisance specifications. It does not answer exclusion, independence, or monotonicity questions.

Extract DoubleML Predictions

We now extract stored predictions from the gradient-boosted IIVM model.

These predictions let us inspect instrument propensity, treatment take-up, first-stage strength, and outcome nuisance quality.

iivm_hgb = iivm_models["Gradient boosting nuisance IIVM"]
dml_g0_hat = prediction_vector(iivm_hgb, "ml_g0")
dml_g1_hat = prediction_vector(iivm_hgb, "ml_g1")
dml_m_hat = np.clip(prediction_vector(iivm_hgb, "ml_m"), 0.02, 0.98)
dml_r0_hat = np.clip(prediction_vector(iivm_hgb, "ml_r0"), 0.01, 0.99)
dml_r1_hat = np.clip(prediction_vector(iivm_hgb, "ml_r1"), 0.01, 0.99)

dml_prediction_quality = pd.DataFrame(
    [
        {"nuisance": "ml_g0 outcome under Z=0", "rmse_vs_truth": rmse(iivm_df["true_g0"], dml_g0_hat), "mae_vs_truth": float(mean_absolute_error(iivm_df["true_g0"], dml_g0_hat)), "correlation_vs_truth": float(np.corrcoef(iivm_df["true_g0"], dml_g0_hat)[0, 1])},
        {"nuisance": "ml_g1 outcome under Z=1", "rmse_vs_truth": rmse(iivm_df["true_g1"], dml_g1_hat), "mae_vs_truth": float(mean_absolute_error(iivm_df["true_g1"], dml_g1_hat)), "correlation_vs_truth": float(np.corrcoef(iivm_df["true_g1"], dml_g1_hat)[0, 1])},
        {"nuisance": "ml_m instrument propensity", "rmse_vs_truth": rmse(iivm_df["true_instrument_propensity"], dml_m_hat), "mae_vs_truth": float(mean_absolute_error(iivm_df["true_instrument_propensity"], dml_m_hat)), "correlation_vs_truth": float(np.corrcoef(iivm_df["true_instrument_propensity"], dml_m_hat)[0, 1]), "auc_for_instrument": float(roc_auc_score(z, dml_m_hat)), "log_loss_for_instrument": float(log_loss(z, dml_m_hat))},
        {"nuisance": "ml_r0 treatment take-up under Z=0", "rmse_vs_truth": rmse(iivm_df["true_r0"], dml_r0_hat), "mae_vs_truth": float(mean_absolute_error(iivm_df["true_r0"], dml_r0_hat)), "correlation_vs_truth": float(np.corrcoef(iivm_df["true_r0"], dml_r0_hat)[0, 1])},
        {"nuisance": "ml_r1 treatment take-up under Z=1", "rmse_vs_truth": rmse(iivm_df["true_r1"], dml_r1_hat), "mae_vs_truth": float(mean_absolute_error(iivm_df["true_r1"], dml_r1_hat)), "correlation_vs_truth": float(np.corrcoef(iivm_df["true_r1"], dml_r1_hat)[0, 1])},
    ]
)
save_table(dml_prediction_quality, "hgb_nuisance_quality_against_truth")
display(dml_prediction_quality)

	nuisance	rmse_vs_truth	mae_vs_truth	correlation_vs_truth	auc_for_instrument	log_loss_for_instrument
0	ml_g0 outcome under Z=0	0.446680	0.342004	0.886067	NaN	NaN
1	ml_g1 outcome under Z=1	0.460292	0.355062	0.912415	NaN	NaN
2	ml_m instrument propensity	0.113231	0.090220	0.816435	0.656847	0.664084
3	ml_r0 treatment take-up under Z=0	0.117381	0.094505	0.385844	NaN	NaN
4	ml_r1 treatment take-up under Z=1	0.154934	0.125704	0.672251	NaN	NaN

The nuisance predictions are good enough for the teaching design, but not perfect. The first-stage nuisance functions are especially important because they form the denominator of the local-effect ratio.

DoubleML First-Stage Diagnostics

This cell summarizes the DoubleML-estimated first stage r1_hat(X) - r0_hat(X).

A strong positive first stage supports relevance and monotonicity-style behavior in the fitted nuisance functions.

dml_first_stage = dml_r1_hat - dml_r0_hat

dml_first_stage_summary = pd.DataFrame(
    [
        {"diagnostic": "mean first stage r1_hat - r0_hat", "value": float(np.mean(dml_first_stage))},
        {"diagnostic": "first-stage p05", "value": float(np.quantile(dml_first_stage, 0.05))},
        {"diagnostic": "first-stage median", "value": float(np.median(dml_first_stage))},
        {"diagnostic": "first-stage p95", "value": float(np.quantile(dml_first_stage, 0.95))},
        {"diagnostic": "share with negative estimated first stage", "value": float(np.mean(dml_first_stage < 0))},
        {"diagnostic": "true average first stage", "value": TRUE_FIRST_STAGE},
    ]
)
save_table(dml_first_stage_summary, "doubleml_first_stage_summary")
display(dml_first_stage_summary)

	diagnostic	value
0	mean first stage r1_hat - r0_hat	0.509133
1	first-stage p05	0.085867
2	first-stage median	0.530857
3	first-stage p95	0.852517
4	share with negative estimated first stage	0.024600
5	true average first stage	0.447774

The estimated first stage is positive on average. A large share of negative estimated first stages would be a warning sign for monotonicity, learner instability, or weak local treatment movement.

First-Stage And Propensity Plots

The left panel shows the estimated first-stage distribution. The right panel shows the estimated instrument propensity by observed instrument status.

Together these plots check relevance and instrument overlap.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.histplot(dml_first_stage, bins=55, color="#047857", alpha=0.75, ax=axes[0])
axes[0].axvline(0, color="#111827", linestyle="--", linewidth=1)
axes[0].axvline(TRUE_FIRST_STAGE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True average first stage")
axes[0].set_title("DoubleML Estimated First Stage")
axes[0].set_xlabel("r1_hat(X) - r0_hat(X)")
axes[0].legend()

propensity_plot_df = pd.DataFrame({"instrument_propensity": dml_m_hat, "encouragement": z})
sns.histplot(
    data=propensity_plot_df,
    x="instrument_propensity",
    hue="encouragement",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.35,
    ax=axes[1],
)
axes[1].set_title("Estimated Instrument Propensity Overlap")
axes[1].set_xlabel("m_hat(X) = P(Z=1 | X)")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_first_stage_and_instrument_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The first-stage distribution is mostly positive and the instrument propensity has adequate overlap. These are necessary diagnostics, not sufficient proof of IV validity.

Visual Nuisance Diagnostics

The next figure compares DoubleML’s gradient-boosted nuisance predictions with true synthetic nuisance functions.

In real data, this truth comparison is not possible. The analogous workflow is to inspect out-of-fold losses, calibration, overlap, and first-stage stability.

sample_idx = iivm_df.sample(1_000, random_state=RANDOM_STATE).index

fig, axes = plt.subplots(2, 3, figsize=(16, 9))
plot_specs = [
    (iivm_df["true_g0"].to_numpy(), dml_g0_hat, "Outcome Nuisance Z=0", "True g0(X)", "Predicted g0_hat(X)", "#2563eb"),
    (iivm_df["true_g1"].to_numpy(), dml_g1_hat, "Outcome Nuisance Z=1", "True g1(X)", "Predicted g1_hat(X)", "#047857"),
    (iivm_df["true_instrument_propensity"].to_numpy(), dml_m_hat, "Instrument Propensity", "True m(X)", "Predicted m_hat(X)", "#b45309"),
    (iivm_df["true_r0"].to_numpy(), dml_r0_hat, "Treatment Take-Up Z=0", "True r0(X)", "Predicted r0_hat(X)", "#7c3aed"),
    (iivm_df["true_r1"].to_numpy(), dml_r1_hat, "Treatment Take-Up Z=1", "True r1(X)", "Predicted r1_hat(X)", "#0891b2"),
]

for ax, (truth, pred, title, xlabel, ylabel, color) in zip(axes.ravel(), plot_specs):
    ax.scatter(truth[sample_idx], pred[sample_idx], alpha=0.25, s=16, color=color)
    lower = min(np.quantile(truth, 0.01), np.quantile(pred, 0.01))
    upper = max(np.quantile(truth, 0.99), np.quantile(pred, 0.99))
    ax.plot([lower, upper], [lower, upper], color="#b91c1c", linestyle="--", linewidth=1.3)
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

axes.ravel()[-1].axis("off")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_doubleml_nuisance_prediction_quality.png", dpi=160, bbox_inches="tight")
plt.show()

The nuisance plots show the model captures the main patterns but smooths some extremes. That is normal for finite-sample machine-learning nuisance estimation.

Score Contributions

DoubleML stores orthogonal score contributions in psi. Large tails can signal influential observations, unstable instrument propensities, or weak first-stage regions.

This cell summarizes and plots score contributions from the main gradient-boosted IIVM fit.

psi_values = np.asarray(iivm_hgb.psi)[:, 0, 0]
psi_summary = pd.DataFrame(
    [
        {
            "mean": float(np.mean(psi_values)),
            "std": float(np.std(psi_values, ddof=1)),
            "p01": float(np.quantile(psi_values, 0.01)),
            "p05": float(np.quantile(psi_values, 0.05)),
            "median": float(np.median(psi_values)),
            "p95": float(np.quantile(psi_values, 0.95)),
            "p99": float(np.quantile(psi_values, 0.99)),
        }
    ]
)
save_table(psi_summary, "orthogonal_score_summary")
display(psi_summary)

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(psi_values, bins=80, color="#2563eb", alpha=0.75, ax=ax)
ax.axvline(0, color="#111827", linewidth=1, linestyle="--")
ax.set_title("IIVM Orthogonal Score Contributions")
ax.set_xlabel("psi_i")
ax.set_ylabel("Count")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_orthogonal_score_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

	mean	std	p01	p05	median	p95	p99
0	1.080025e-16	2.694336	-6.732241	-3.857392	-0.044926	4.082471	7.634711

The score is centered near zero at the fitted estimate. The tails are moderate for this well-behaved teaching design.

Bootstrap Confidence Interval

DoubleML can compute bootstrap confidence intervals. We run a moderate bootstrap for the main gradient-boosted IIVM model.

As always, bootstrap uncertainty does not validate the instrument assumptions.

iivm_hgb.bootstrap(method="normal", n_rep_boot=500)
bootstrap_ci = iivm_hgb.confint(joint=False, level=0.95).reset_index().rename(columns={"index": "treatment"})
bootstrap_ci["estimator"] = "Gradient boosting nuisance IIVM bootstrap"
save_table(bootstrap_ci, "hgb_bootstrap_confidence_interval")
display(bootstrap_ci)

	treatment	2.5 %	97.5 %	estimator
0	feature_exposure	1.046249	1.365795	Gradient boosting nuisance IIVM bootstrap

The bootstrap interval quantifies sampling uncertainty for the local effect under the fitted design. It should be reported with first-stage and instrument-validity discussion.

Repeated Sample Splitting

Cross-fitted estimates can move with different fold splits. Repeated sample splitting checks numerical stability.

This cell uses a slightly lighter gradient-boosting configuration so the check remains practical.

repeated_outcome_learner = HistGradientBoostingRegressor(
    max_iter=120,
    learning_rate=0.06,
    max_leaf_nodes=22,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
repeated_instrument_learner = HistGradientBoostingClassifier(
    max_iter=120,
    learning_rate=0.06,
    max_leaf_nodes=22,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
repeated_treatment_learner = HistGradientBoostingClassifier(
    max_iter=120,
    learning_rate=0.06,
    max_leaf_nodes=22,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)

iivm_repeated = DoubleMLIIVM(
    dml_data,
    ml_g=clone(repeated_outcome_learner),
    ml_m=clone(repeated_instrument_learner),
    ml_r=clone(repeated_treatment_learner),
    n_folds=5,
    n_rep=5,
    score="LATE",
)
iivm_repeated.fit(store_predictions=False)

split_stability = pd.DataFrame(
    {
        "repetition": np.arange(1, iivm_repeated.all_coef.shape[1] + 1),
        "theta_hat": iivm_repeated.all_coef[0],
        "std_error": iivm_repeated.all_se[0],
    }
)
split_stability["true_late"] = TRUE_LATE
split_stability["bias_vs_true_late"] = split_stability["theta_hat"] - TRUE_LATE
save_table(split_stability, "repeated_sample_split_stability")
display(split_stability)

	repetition	theta_hat	std_error	true_late	bias_vs_true_late
0	1	1.068802	0.075608	1.101928	-0.033125
1	2	1.151013	0.078619	1.101928	0.049085
2	3	1.141402	0.077209	1.101928	0.039474
3	4	1.127719	0.078196	1.101928	0.025792
4	5	1.161182	0.075708	1.101928	0.059254

The repeated estimates should be reasonably close. If they jump around, first-stage instability or nuisance-model sensitivity needs attention.

Split Stability Plot

The next figure shows repeated-split local-effect estimates against the true synthetic local effect.

fig, ax = plt.subplots(figsize=(8.5, 4.8))
sns.lineplot(data=split_stability, x="repetition", y="theta_hat", marker="o", color="#2563eb", ax=ax)
ax.axhline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE")
ax.set_title("IIVM Repeated Sample-Splitting Stability")
ax.set_xlabel("Repetition")
ax.set_ylabel("Estimated LATE")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_repeated_split_stability.png", dpi=160, bbox_inches="tight")
plt.show()

The estimate is stable across repeated splits in this synthetic design. That supports numerical reliability, not instrument validity by itself.

Subgroup Assumption Variants

DoubleMLIIVM has a subgroups argument for designs where always-takers or never-takers are known to be absent.

Our synthetic data contains both always-takers and never-takers. The default setting is therefore the right one. The next cell fits alternative subgroup restrictions to show that incorrect compliance restrictions can change the estimate.

subgroup_specs = {
    "Default: always and never possible": None,
    "Assume no always-takers": {"always_takers": False, "never_takers": True},
    "Assume no never-takers": {"always_takers": True, "never_takers": False},
    "Assume only compliers": {"always_takers": False, "never_takers": False},
}

subgroup_rows = []
for label, subgroup_config in subgroup_specs.items():
    model = DoubleMLIIVM(
        dml_data,
        ml_g=clone(repeated_outcome_learner),
        ml_m=clone(repeated_instrument_learner),
        ml_r=clone(repeated_treatment_learner),
        n_folds=5,
        n_rep=1,
        score="LATE",
        subgroups=subgroup_config,
    )
    model.fit(store_predictions=False)
    row = model.summary.reset_index().rename(columns={"index": "treatment"}).iloc[0]
    subgroup_rows.append(
        {
            "subgroup_assumption": label,
            "theta_hat": row["coef"],
            "std_error": row["std err"],
            "ci_95_lower": row["2.5 %"],
            "ci_95_upper": row["97.5 %"],
            "true_late": TRUE_LATE,
            "bias_vs_true_late": row["coef"] - TRUE_LATE,
        }
    )

subgroup_comparison = pd.DataFrame(subgroup_rows)
save_table(subgroup_comparison, "subgroup_assumption_comparison")
display(subgroup_comparison)

	subgroup_assumption	theta_hat	std_error	ci_95_lower	ci_95_upper	true_late	bias_vs_true_late
0	Default: always and never possible	1.145934	0.077302	0.994425	1.297442	1.101928	0.044006
1	Assume no always-takers	1.148685	0.078316	0.995190	1.302181	1.101928	0.046758
2	Assume no never-takers	1.170313	0.083855	1.005961	1.334666	1.101928	0.068385
3	Assume only compliers	1.214532	0.085916	1.046140	1.382925	1.101928	0.112605

The subgroup variants are not interchangeable. Use them only when the design genuinely rules out specific compliance types.

First-Stage Stress Test

A weak instrument creates a small denominator in the IIVM ratio. Even with perfect nuisance functions, weak first stages inflate uncertainty and can make estimates unstable.

This synthetic stress test changes only the complier share and uses oracle nuisance functions so the first-stage issue is isolated.

weak_rows = []
base_tau = true_tau.to_numpy()
base_mu0 = true_mu0.to_numpy()
base_m = true_instrument_propensity.to_numpy()
base_z = encouragement
base_noise = outcome_noise

for multiplier in [0.25, 0.50, 0.75, 1.00, 1.25]:
    pc_stress = np.clip(p_complier.to_numpy() * multiplier, 0.03, 0.90)
    pa_stress = p_always.to_numpy()
    pn_stress = np.clip(1 - pc_stress - pa_stress, 0.03, 0.95)
    total = pc_stress + pa_stress + pn_stress
    pc_stress = pc_stress / total
    pa_stress = pa_stress / total
    pn_stress = pn_stress / total

    draw = rng.uniform(size=N_OBS)
    stress_type = np.where(draw < pa_stress, "always_taker", np.where(draw < pa_stress + pc_stress, "complier", "never_taker"))
    stress_d = np.where(stress_type == "always_taker", 1, np.where(stress_type == "never_taker", 0, base_z))
    stress_y = base_mu0 + base_tau * stress_d + base_noise
    stress_r0 = pa_stress
    stress_r1 = pa_stress + pc_stress
    stress_g0 = base_mu0 + base_tau * stress_r0
    stress_g1 = base_mu0 + base_tau * stress_r1
    stress_late = float(np.mean(pc_stress * base_tau) / np.mean(pc_stress))

    estimate = iivm_ratio_summary(stress_y, stress_d, base_z, stress_g0, stress_g1, base_m, stress_r0, stress_r1, f"first-stage multiplier={multiplier}")
    weak_rows.append(
        {
            "first_stage_multiplier": multiplier,
            "oracle_late": stress_late,
            "mean_first_stage": float(np.mean(stress_r1 - stress_r0)),
            "theta_hat_oracle_iivm": estimate["theta_hat"],
            "std_error_oracle_iivm": estimate["std_error"],
            "ci_95_lower": estimate["ci_95_lower"],
            "ci_95_upper": estimate["ci_95_upper"],
        }
    )

weak_instrument_table = pd.DataFrame(weak_rows)
weak_instrument_table["bias_vs_oracle_late"] = weak_instrument_table["theta_hat_oracle_iivm"] - weak_instrument_table["oracle_late"]
save_table(weak_instrument_table, "weak_instrument_stress_test")
display(weak_instrument_table)

	first_stage_multiplier	oracle_late	mean_first_stage	theta_hat_oracle_iivm	std_error_oracle_iivm	ci_95_lower	ci_95_upper	bias_vs_oracle_late
0	0.25	1.101928	0.111944	1.144375	0.252240	0.649985	1.638765	0.042447
1	0.50	1.101928	0.223887	1.144239	0.124127	0.900950	1.387528	0.042311
2	0.75	1.101928	0.335831	1.132889	0.094093	0.948467	1.317311	0.030961
3	1.00	1.101928	0.447774	1.105957	0.066398	0.975816	1.236097	0.004029
4	1.25	1.098718	0.556162	1.110258	0.054945	1.002567	1.217950	0.011541

As the first stage shrinks, uncertainty grows. A binary IV design with very few compliers cannot support a precise local effect estimate.

Weak-Instrument Plot

The left panel shows oracle local-effect estimates under different first-stage strengths. The right panel shows the denominator of the ratio.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

axes[0].errorbar(
    x=weak_instrument_table["first_stage_multiplier"],
    y=weak_instrument_table["theta_hat_oracle_iivm"],
    yerr=[
        weak_instrument_table["theta_hat_oracle_iivm"] - weak_instrument_table["ci_95_lower"],
        weak_instrument_table["ci_95_upper"] - weak_instrument_table["theta_hat_oracle_iivm"],
    ],
    fmt="o-",
    color="#2563eb",
    ecolor="#475569",
    capsize=4,
)
axes[0].set_title("Oracle IIVM Under First-Stage Stress")
axes[0].set_xlabel("Complier-Share Multiplier")
axes[0].set_ylabel("Estimated LATE")

sns.lineplot(
    data=weak_instrument_table,
    x="first_stage_multiplier",
    y="mean_first_stage",
    marker="o",
    color="#047857",
    ax=axes[1],
)
axes[1].set_title("Mean First Stage")
axes[1].set_xlabel("Complier-Share Multiplier")
axes[1].set_ylabel("E[r1(X) - r0(X)]")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_weak_instrument_stress_test.png", dpi=160, bbox_inches="tight")
plt.show()

The first stage is the denominator of the estimate. When it gets small, a local-effect estimate can become noisy even if the instrument is valid.

Exclusion-Violation Stress Test

Now we break the exclusion restriction by adding a direct effect of the instrument to the outcome.

The calculation uses oracle nuisance functions so the design violation is isolated. If the instrument affects the outcome directly, the IV ratio no longer recovers the treatment effect for compliers.

exclusion_rows = []
for direct_effect in [0.00, 0.10, 0.25, 0.50]:
    invalid_y = weekly_value + direct_effect * encouragement
    invalid_g0 = true_g0.to_numpy()
    invalid_g1 = true_g1.to_numpy() + direct_effect
    estimate = iivm_ratio_summary(
        invalid_y,
        d,
        z,
        invalid_g0,
        invalid_g1,
        iivm_df["true_instrument_propensity"].to_numpy(),
        iivm_df["true_r0"].to_numpy(),
        iivm_df["true_r1"].to_numpy(),
        f"direct Z effect={direct_effect:.2f}",
    )
    exclusion_rows.append(
        {
            "direct_instrument_effect": direct_effect,
            "theta_hat_oracle_iivm": estimate["theta_hat"],
            "std_error_oracle_iivm": estimate["std_error"],
            "ci_95_lower": estimate["ci_95_lower"],
            "ci_95_upper": estimate["ci_95_upper"],
            "true_late_without_direct_effect": TRUE_LATE,
            "bias_vs_true_late": estimate["theta_hat"] - TRUE_LATE,
        }
    )

exclusion_violation_table = pd.DataFrame(exclusion_rows)
save_table(exclusion_violation_table, "exclusion_violation_stress_test")
display(exclusion_violation_table)

	direct_instrument_effect	theta_hat_oracle_iivm	std_error_oracle_iivm	ci_95_lower	ci_95_upper	true_late_without_direct_effect	bias_vs_true_late
0	0.00	1.116161	0.064275	0.990182	1.242139	1.101928	0.014233
1	0.10	1.327639	0.064680	1.200866	1.454412	1.101928	0.225711
2	0.25	1.644857	0.066289	1.514931	1.774783	1.101928	0.542929
3	0.50	2.173553	0.071402	2.033606	2.313501	1.101928	1.071626

The estimate moves as the direct instrument effect grows. This is the central IV warning: a strong first stage does not protect against a bad exclusion restriction.

Exclusion-Violation Plot

The next figure visualizes how direct instrument effects bias the local-effect estimate.

fig, ax = plt.subplots(figsize=(8.5, 5))
sns.lineplot(
    data=exclusion_violation_table,
    x="direct_instrument_effect",
    y="theta_hat_oracle_iivm",
    marker="o",
    color="#b45309",
    ax=ax,
)
ax.fill_between(
    exclusion_violation_table["direct_instrument_effect"],
    exclusion_violation_table["ci_95_lower"],
    exclusion_violation_table["ci_95_upper"],
    color="#f59e0b",
    alpha=0.18,
)
ax.axhline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE without direct effect")
ax.set_title("Bias From Exclusion Restriction Violations")
ax.set_xlabel("Direct Effect Of Instrument On Outcome")
ax.set_ylabel("Estimated Local Effect")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_exclusion_violation_stress_test.png", dpi=160, bbox_inches="tight")
plt.show()

This plot shows why instrument design has to be argued carefully. The model can estimate a ratio, but the ratio is causal only under the IV assumptions.

IIVM Compared With IRM

A tempting mistake is to fit a binary-treatment IRM model and treat it as equivalent to IIVM. They target different quantities.

This cell fits a gradient-boosted IRM ATE on the same observed treatment and controls. Because treatment is confounded by compliance-related latent factors, the IRM ATE is not the right design here.

irm_comparison = DoubleMLIRM(
    DoubleMLData(
        iivm_df[["weekly_value", "feature_exposure"] + feature_cols],
        y_col="weekly_value",
        d_cols="feature_exposure",
        x_cols=feature_cols,
    ),
    ml_g=clone(hgb_outcome_learner),
    ml_m=clone(hgb_treatment_learner),
    n_folds=5,
    n_rep=1,
    score="ATE",
)
irm_comparison.fit(store_predictions=True)

irm_vs_iivm = pd.concat(
    [
        irm_comparison.summary.reset_index().rename(columns={"index": "treatment"}).assign(model="IRM ATE on observed treatment", target_reference=TRUE_ATE),
        iivm_hgb.summary.reset_index().rename(columns={"index": "treatment"}).assign(model="IIVM LATE with instrument", target_reference=TRUE_LATE),
    ],
    ignore_index=True,
)
irm_vs_iivm["bias_vs_reference"] = irm_vs_iivm["coef"] - irm_vs_iivm["target_reference"]
save_table(irm_vs_iivm, "irm_vs_iivm_comparison")
display(irm_vs_iivm)

	treatment	coef	std err	t	P>\|t\|	2.5 %	97.5 %	model	target_reference	bias_vs_reference
0	feature_exposure	1.192215	0.039338	30.307254	9.199046e-202	1.115115	1.269315	IRM ATE on observed treatment	1.020535	0.171679
1	feature_exposure	1.206022	0.081519	14.794455	1.590685e-49	1.046249	1.365795	IIVM LATE with instrument	1.101928	0.104094

The IRM and IIVM estimates answer different questions. In this simulation, the IIVM design is the appropriate response to noncompliance and latent treatment selection.

When IIVM Is The Right Or Wrong Tool

IIVM is useful when:

treatment is binary;
instrument is binary;
noncompliance is central;
the instrument plausibly shifts treatment but does not directly affect outcome;
the desired target is a local effect for instrument-responsive units.

IIVM is not enough when:

the instrument is weak;
the instrument directly affects the outcome;
defiers are plausible and monotonicity is not defensible;
the desired target is the full-population ATE;
treatment or instrument are continuous rather than binary.

Reporting Checklist

A useful IIVM report should make the local-effect meaning impossible to miss.

The checklist below captures the design and diagnostics needed for binary IV reporting.

reporting_checklist = pd.DataFrame(
    [
        {"item": "Causal question", "status": "Estimate the local effect of feature_exposure on weekly_value for instrument-responsive units."},
        {"item": "Treatment and instrument", "status": "Both feature_exposure and encouragement are binary."},
        {"item": "Target", "status": "LATE, not full-population ATE."},
        {"item": "Instrument relevance", "status": "First-stage take-up diagnostics reported."},
        {"item": "Instrument overlap", "status": "Instrument propensity diagnostics reported."},
        {"item": "Exclusion", "status": "Valid by construction in the synthetic data; requires design evidence in real data."},
        {"item": "Conditional independence", "status": "Valid by construction after X in the synthetic data; requires design evidence in real data."},
        {"item": "Monotonicity", "status": "No defiers by construction; requires design evidence in real data."},
        {"item": "Nuisance learners", "status": "Compared linear/logistic and gradient-boosted nuisance sets."},
        {"item": "Cross-fitting", "status": "Manual ratio and DoubleMLIIVM use cross-fitted nuisance predictions."},
        {"item": "Uncertainty", "status": "Standard errors, confidence intervals, bootstrap interval, and split stability included."},
        {"item": "Stress tests", "status": "Weak first-stage and exclusion-violation stress tests included."},
    ]
)
save_table(reporting_checklist, "iivm_reporting_checklist")
display(reporting_checklist)

	item	status
0	Causal question	Estimate the local effect of feature_exposure ...
1	Treatment and instrument	Both feature_exposure and encouragement are bi...
2	Target	LATE, not full-population ATE.
3	Instrument relevance	First-stage take-up diagnostics reported.
4	Instrument overlap	Instrument propensity diagnostics reported.
5	Exclusion	Valid by construction in the synthetic data; r...
6	Conditional independence	Valid by construction after X in the synthetic...
7	Monotonicity	No defiers by construction; requires design ev...
8	Nuisance learners	Compared linear/logistic and gradient-boosted ...
9	Cross-fitting	Manual ratio and DoubleMLIIVM use cross-fitted...
10	Uncertainty	Standard errors, confidence intervals, bootstr...
11	Stress tests	Weak first-stage and exclusion-violation stres...

The checklist separates estimable diagnostics from assumptions. That separation matters even more for IV than for ordinary observed-control designs.

Report Template

The next cell writes a short markdown report template using the main gradient-boosted IIVM estimate.

This template keeps the local-effect target, first stage, instrument assumptions, and limitations close to the numeric estimate.

main_row = iivm_summary.loc[iivm_summary["estimator"] == "Gradient boosting nuisance IIVM"].iloc[0]
first_stage_mean = dml_first_stage_summary.loc[
    dml_first_stage_summary["diagnostic"] == "mean first stage r1_hat - r0_hat", "value"
].iloc[0]

report_text = f"""
# IIVM Local Effect Estimate Report Template

## Causal Question
Estimate the local effect of `feature_exposure` on `weekly_value` for units whose treatment status is shifted by `encouragement`.

## Target
The target is LATE, not the full-population ATE. The estimate applies to the instrument-responsive margin under the IV assumptions.

## Main Estimate
- Estimated local effect: {main_row['theta_hat']:.4f}
- Standard error: {main_row['std_error']:.4f}
- 95 percent confidence interval: [{main_row['ci_95_lower']:.4f}, {main_row['ci_95_upper']:.4f}]

## First Stage
- Mean estimated first stage: {first_stage_mean:.4f}

## Estimator
The main estimator is `DoubleMLIIVM` with five-fold cross-fitting, histogram gradient-boosted outcome nuisance models, and histogram gradient-boosted classifiers for instrument and treatment take-up nuisance models.

## Diagnostics Included
- Direct treatment comparisons, raw Wald, residualized Wald, oracle ratio, manual cross-fitted IIVM, and DoubleML IIVM comparisons.
- Compliance-type summary for the synthetic teaching data.
- Instrument overlap diagnostics.
- First-stage distribution diagnostics.
- Nuisance learner losses and prediction checks.
- Orthogonal score contribution checks.
- Bootstrap confidence interval.
- Repeated sample-splitting stability.
- Subgroup assumption variants.
- Weak first-stage and exclusion-violation stress tests.

## Required Assumptions
The local effect is causal only if the instrument is relevant, has no direct effect on the outcome, is conditionally independent after controls, satisfies monotonicity, and has adequate overlap. DoubleML estimates the score under these assumptions; it does not establish them.
""".strip()

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_iivm_report_template.md"
report_path.write_text(report_text)
print(report_text)

# IIVM Local Effect Estimate Report Template

## Causal Question
Estimate the local effect of `feature_exposure` on `weekly_value` for units whose treatment status is shifted by `encouragement`.

## Target
The target is LATE, not the full-population ATE. The estimate applies to the instrument-responsive margin under the IV assumptions.

## Main Estimate
- Estimated local effect: 1.2060
- Standard error: 0.0815
- 95 percent confidence interval: [1.0462, 1.3658]

## First Stage
- Mean estimated first stage: 0.5091

## Estimator
The main estimator is `DoubleMLIIVM` with five-fold cross-fitting, histogram gradient-boosted outcome nuisance models, and histogram gradient-boosted classifiers for instrument and treatment take-up nuisance models.

## Diagnostics Included
- Direct treatment comparisons, raw Wald, residualized Wald, oracle ratio, manual cross-fitted IIVM, and DoubleML IIVM comparisons.
- Compliance-type summary for the synthetic teaching data.
- Instrument overlap diagnostics.
- First-stage distribution diagnostics.
- Nuisance learner losses and prediction checks.
- Orthogonal score contribution checks.
- Bootstrap confidence interval.
- Repeated sample-splitting stability.
- Subgroup assumption variants.
- Weak first-stage and exclusion-violation stress tests.

## Required Assumptions
The local effect is causal only if the instrument is relevant, has no direct effect on the outcome, is conditionally independent after controls, satisfies monotonicity, and has adequate overlap. DoubleML estimates the score under these assumptions; it does not establish them.

The report template is deliberately explicit about local-effect meaning. That is the most common place where binary IV results get overclaimed.

Artifact Manifest

The final cell lists every artifact produced by this notebook so the outputs are easy to find later.

artifact_rows = []
for folder, artifact_type in [
    (DATASET_DIR, "dataset"),
    (TABLE_DIR, "table"),
    (FIGURE_DIR, "figure"),
    (REPORT_DIR, "report"),
]:
    for file_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": file_path.name,
                "relative_path": str(file_path.relative_to(REPO_ROOT)),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
save_table(artifact_manifest, "artifact_manifest")
display(artifact_manifest)

	artifact_type	file_name	relative_path
0	dataset	06_synthetic_iivm_data.csv	notebooks/tutorials/doubleml/outputs/datasets/...
1	table	06_all_estimator_comparison.csv	notebooks/tutorials/doubleml/outputs/tables/06...
2	table	06_artifact_manifest.csv	notebooks/tutorials/doubleml/outputs/tables/06...
3	table	06_baseline_estimates.csv	notebooks/tutorials/doubleml/outputs/tables/06...
4	table	06_compliance_type_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
5	table	06_data_audit.csv	notebooks/tutorials/doubleml/outputs/tables/06...
6	table	06_design_correlation_matrix.csv	notebooks/tutorials/doubleml/outputs/tables/06...
7	table	06_doubleml_first_stage_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
8	table	06_doubleml_iivm_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
9	table	06_doubleml_nuisance_losses.csv	notebooks/tutorials/doubleml/outputs/tables/06...
10	table	06_doublemldata_backend_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
11	table	06_exclusion_violation_stress_test.csv	notebooks/tutorials/doubleml/outputs/tables/06...
12	table	06_field_dictionary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
13	table	06_hgb_bootstrap_confidence_interval.csv	notebooks/tutorials/doubleml/outputs/tables/06...
14	table	06_hgb_nuisance_quality_against_truth.csv	notebooks/tutorials/doubleml/outputs/tables/06...
15	table	06_iivm_reporting_checklist.csv	notebooks/tutorials/doubleml/outputs/tables/06...
16	table	06_instrument_overlap_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
17	table	06_irm_vs_iivm_comparison.csv	notebooks/tutorials/doubleml/outputs/tables/06...
18	table	06_learner_catalog.csv	notebooks/tutorials/doubleml/outputs/tables/06...
19	table	06_manual_crossfit_iivm_estimate.csv	notebooks/tutorials/doubleml/outputs/tables/06...
20	table	06_manual_first_stage_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
21	table	06_manual_nuisance_quality.csv	notebooks/tutorials/doubleml/outputs/tables/06...
22	table	06_orthogonal_score_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
23	table	06_package_versions.csv	notebooks/tutorials/doubleml/outputs/tables/06...
24	table	06_repeated_sample_split_stability.csv	notebooks/tutorials/doubleml/outputs/tables/06...
25	table	06_subgroup_assumption_comparison.csv	notebooks/tutorials/doubleml/outputs/tables/06...
26	table	06_target_summary.csv	notebooks/tutorials/doubleml/outputs/tables/06...
27	table	06_weak_instrument_stress_test.csv	notebooks/tutorials/doubleml/outputs/tables/06...
28	figure	06_baseline_estimates.png	notebooks/tutorials/doubleml/outputs/figures/0...
29	figure	06_compliance_and_take_up.png	notebooks/tutorials/doubleml/outputs/figures/0...
30	figure	06_design_correlation_matrix.png	notebooks/tutorials/doubleml/outputs/figures/0...
31	figure	06_doubleml_nuisance_prediction_quality.png	notebooks/tutorials/doubleml/outputs/figures/0...
32	figure	06_exclusion_violation_stress_test.png	notebooks/tutorials/doubleml/outputs/figures/0...
33	figure	06_first_stage_and_instrument_overlap.png	notebooks/tutorials/doubleml/outputs/figures/0...
34	figure	06_iivm_design_dag.png	notebooks/tutorials/doubleml/outputs/figures/0...
35	figure	06_iivm_estimate_comparison.png	notebooks/tutorials/doubleml/outputs/figures/0...
36	figure	06_instrument_overlap.png	notebooks/tutorials/doubleml/outputs/figures/0...
37	figure	06_manual_first_stage_distribution.png	notebooks/tutorials/doubleml/outputs/figures/0...
38	figure	06_orthogonal_score_distribution.png	notebooks/tutorials/doubleml/outputs/figures/0...
39	figure	06_repeated_split_stability.png	notebooks/tutorials/doubleml/outputs/figures/0...
40	figure	06_weak_instrument_stress_test.png	notebooks/tutorials/doubleml/outputs/figures/0...
41	report	06_iivm_report_template.md	notebooks/tutorials/doubleml/outputs/reports/0...

The IIVM notebook is complete. The next natural topic is difference-in-differences, where the identifying assumption shifts from observed-control adjustment or instruments to parallel trends.