DoubleML Tutorial 05: Interactive Regression Model IRM

This notebook is a full tutorial on DoubleMLIRM, the interactive regression model for binary treatments.

The previous notebooks handled continuous treatments. Here the treatment is binary: a unit either receives the intervention or does not. That changes the estimand, the nuisance functions, and the diagnostics. Instead of residualizing a continuous treatment, IRM combines outcome regressions and a propensity score to estimate ATE and ATTE-style targets.

The goal is to understand the full workflow: potential-outcome estimands, identification assumptions, propensity overlap, manual AIPW estimation, DoubleML fitting, nuisance diagnostics, and honest reporting.

Learning Goals

By the end of this notebook, you should be able to:

Explain the IRM setup for binary treatments.
Define ATE and ATTE in potential-outcome notation.
State the identification assumptions: consistency, unconfoundedness, overlap, and no interference.
Explain the roles of ml_g and ml_m in DoubleMLIRM.
Manually compute cross-fitted AIPW estimates for ATE and ATT-style targets.
Fit DoubleMLIRM for score="ATE" and score="ATTE".
Diagnose propensity overlap, inverse-propensity weights, nuisance losses, and score contributions.
Understand how weak overlap can inflate uncertainty.

Binary Treatments And Potential Outcomes

For a binary treatment, each unit has two potential outcomes:

\[ Y(1) \quad \text{and} \quad Y(0). \]

Y(1) is the outcome the unit would have under treatment. Y(0) is the outcome the same unit would have without treatment. We observe only one of them:

\[ Y = D Y(1) + (1-D)Y(0), \]

where D = 1 means treated and D = 0 means untreated.

Two common targets are:

\[ ATE = E[Y(1) - Y(0)] \]

and

\[ ATT = E[Y(1) - Y(0) \mid D=1]. \]

DoubleML names the treated-target score ATTE. In this notebook we use ATTE when referring to the DoubleML score option and ATT-style language when explaining the estimand.

Identification Assumptions

IRM is not an instrument or randomized experiment by itself. It estimates causal effects from observational data under assumptions.

The standard observed-control identification assumptions are:

Consistency: the observed outcome equals the potential outcome under the treatment actually received.
Conditional unconfoundedness: after controlling for X, treatment assignment is independent of potential outcomes.
Overlap or positivity: every covariate profile has a nonzero probability of receiving each treatment state.
No interference: one unit’s treatment does not change another unit’s potential outcomes.

DoubleML helps with flexible nuisance estimation and orthogonal scores. It does not make these assumptions true.

IRM Nuisance Functions

DoubleMLIRM uses two main nuisance components:

\[ g_0(0, X) = E[Y \mid D=0, X] \]

\[ g_0(1, X) = E[Y \mid D=1, X] \]

and

\[ m_0(X) = P(D=1 \mid X). \]

In the API:

ml_g estimates the outcome regression. DoubleML internally learns both treated and untreated outcome predictions.
ml_m estimates the propensity score, so it must be a classifier with predict_proba().

This is different from PLR and PLIV. In IRM, the treatment is binary and the propensity score is central.

The AIPW Score For ATE

The ATE score combines regression adjustment and inverse-propensity weighting:

\[ \psi_i = \hat g_1(X_i) - \hat g_0(X_i) + \frac{D_i(Y_i - \hat g_1(X_i))}{\hat m(X_i)} - \frac{(1-D_i)(Y_i - \hat g_0(X_i))}{1-\hat m(X_i)}. \]

The estimate is the average of this score. This is often called an augmented inverse-propensity weighted estimator.

The score is doubly robust in the usual sense: under suitable conditions, the estimate can remain consistent if either the propensity model or the outcome model is correctly specified. DoubleML adds cross-fitting and orthogonalization to make flexible machine-learning nuisance estimation more reliable.

Runtime Note

This notebook fits cross-fitted binary-treatment models, two DoubleML scores, and a repeated-splitting check. On a typical laptop, the full notebook should take roughly two to four minutes.

The examples use a synthetic dataset large enough to show overlap and heterogeneity, but small enough to rerun comfortably.

Setup

This cell prepares the notebook environment. It creates output folders, makes matplotlib cache writes local to the tutorial folder, imports scientific Python libraries, and records package versions.

The path logic supports running from the repository root or directly from the tutorial folder.

from pathlib import Path
import os
import warnings

NOTEBOOK_PREFIX = "05"
RANDOM_STATE = 2026

CURRENT_DIR = Path.cwd().resolve()
if CURRENT_DIR.name == "doubleml":
    REPO_ROOT = CURRENT_DIR.parents[2]
elif (CURRENT_DIR / "notebooks" / "tutorials" / "doubleml").exists():
    REPO_ROOT = CURRENT_DIR
else:
    REPO_ROOT = CURRENT_DIR

OUTPUT_DIR = REPO_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MATPLOTLIB_CACHE_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="IProgress not found.*")
warnings.filterwarnings("ignore", message=".*X does not have valid feature names.*")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LassoCV, LinearRegression, LogisticRegressionCV
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_score, r2_score
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

import doubleml as dml
from doubleml import DoubleMLData, DoubleMLIRM

warnings.filterwarnings("ignore", category=ConvergenceWarning)

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["axes.titleweight"] = "bold"

package_versions = pd.DataFrame(
    [
        {"package": "doubleml", "version": dml.__version__},
        {"package": "numpy", "version": np.__version__},
        {"package": "pandas", "version": pd.__version__},
        {"package": "seaborn", "version": sns.__version__},
        {"package": "statsmodels", "version": sm.__version__},
    ]
)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)

	package	version
0	doubleml	0.11.2
1	numpy	2.4.4
2	pandas	3.0.2
3	seaborn	0.13.2
4	statsmodels	0.14.6

The package versions are part of the reproducibility record. Cross-fitting, tree learners, and classifier calibration can shift slightly across package versions.

Helper Functions

The helper functions below keep the analysis cells focused on the causal workflow.

They handle table saving, sigmoid transformation, baseline summaries, AIPW calculations, cross-fitted nuisance predictions, effective sample size, and DoubleML prediction extraction.

def save_table(df, name):
    path = TABLE_DIR / f"{NOTEBOOK_PREFIX}_{name}.csv"
    df.to_csv(path, index=False)
    return df


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


def treatment_ols_summary(y, X, treatment_col, label):
    X_design = sm.add_constant(X, has_constant="add")
    fit = sm.OLS(y, X_design).fit(cov_type="HC1")
    row = fit.summary2().tables[1].loc[treatment_col]
    return {
        "estimator": label,
        "target": "raw regression slope",
        "theta_hat": float(row["Coef."]),
        "std_error": float(row["Std.Err."]),
        "ci_95_lower": float(row["[0.025"]),
        "ci_95_upper": float(row["0.975]"]),
        "p_value": float(row["P>|z|"]),
    }


def mean_difference_summary(y, d, label):
    treated = y[d == 1]
    untreated = y[d == 0]
    theta_hat = float(treated.mean() - untreated.mean())
    std_error = float(np.sqrt(treated.var(ddof=1) / len(treated) + untreated.var(ddof=1) / len(untreated)))
    return {
        "estimator": label,
        "target": "unadjusted mean difference",
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
    }


def aipw_ate(y, d, g0_hat, g1_hat, m_hat, label):
    m_hat = np.clip(np.asarray(m_hat), 0.02, 0.98)
    score = g1_hat - g0_hat + d * (y - g1_hat) / m_hat - (1 - d) * (y - g0_hat) / (1 - m_hat)
    theta_hat = float(np.mean(score))
    std_error = float(np.std(score, ddof=1) / np.sqrt(len(score)))
    return {
        "estimator": label,
        "target": "ATE",
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
    }


def aipw_att(y, d, g0_hat, m_hat, label):
    m_hat = np.clip(np.asarray(m_hat), 0.02, 0.98)
    p_treated = float(np.mean(d))
    score = d * (y - g0_hat) / p_treated - (1 - d) * m_hat * (y - g0_hat) / ((1 - m_hat) * p_treated)
    theta_hat = float(np.mean(score))
    std_error = float(np.std(score, ddof=1) / np.sqrt(len(score)))
    return {
        "estimator": label,
        "target": "ATT-style",
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
    }


def ipw_ate(y, d, m_hat, label):
    m_hat = np.clip(np.asarray(m_hat), 0.02, 0.98)
    score = d * y / m_hat - (1 - d) * y / (1 - m_hat)
    theta_hat = float(np.mean(score))
    std_error = float(np.std(score, ddof=1) / np.sqrt(len(score)))
    return {
        "estimator": label,
        "target": "ATE",
        "theta_hat": theta_hat,
        "std_error": std_error,
        "ci_95_lower": theta_hat - 1.96 * std_error,
        "ci_95_upper": theta_hat + 1.96 * std_error,
        "p_value": np.nan,
    }


def weighted_effective_sample_size(weights):
    weights = np.asarray(weights)
    return float((weights.sum() ** 2) / np.sum(weights ** 2))


def crossfit_irm_nuisances(X, y, d, outcome_learner, propensity_learner, n_splits=5, random_state=RANDOM_STATE):
    cv = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    g0_hat = np.zeros(len(y))
    g1_hat = np.zeros(len(y))
    m_hat = np.zeros(len(y))

    for train_idx, test_idx in cv.split(X):
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        y_train = y[train_idx]
        d_train = d[train_idx]

        g0_model = clone(outcome_learner).fit(X_train[d_train == 0], y_train[d_train == 0])
        g1_model = clone(outcome_learner).fit(X_train[d_train == 1], y_train[d_train == 1])
        m_model = clone(propensity_learner).fit(X_train, d_train)

        g0_hat[test_idx] = g0_model.predict(X_test)
        g1_hat[test_idx] = g1_model.predict(X_test)
        m_hat[test_idx] = m_model.predict_proba(X_test)[:, 1]

    return g0_hat, g1_hat, np.clip(m_hat, 0.02, 0.98)


def prediction_vector(doubleml_model, learner_key):
    arr = np.asarray(doubleml_model.predictions[learner_key])
    if arr.ndim != 3:
        raise ValueError(f"Expected a 3D prediction array, got shape {arr.shape}")
    return arr[:, 0, 0]


def learner_loss_table(doubleml_model, model_label):
    losses = doubleml_model.evaluate_learners()
    rows = []
    for learner_name, values in losses.items():
        arr = np.asarray(values)
        rows.append(
            {
                "model": model_label,
                "learner": learner_name,
                "mean_loss": float(np.mean(arr)),
                "min_loss": float(np.min(arr)),
                "max_loss": float(np.max(arr)),
            }
        )
    return pd.DataFrame(rows)

The manual AIPW formulas are included so the notebook does not feel like magic. DoubleML automates these ideas with careful score construction, sample splitting, repeated splitting, bootstrap tools, and consistent model objects.

Draw The IRM Design

The diagram below shows the observed-control design.

Observed covariates X affect both treatment assignment and potential outcomes. The binary treatment D affects the observed outcome Y. Under unconfoundedness, adjusting for X is enough to compare treated and untreated potential outcomes.

The dashed box reminds us that unobserved confounding is a design threat. If a hidden variable affects both treatment and outcome after conditioning on X, IRM is not enough.

from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

nodes = {
    "X": {"xy": (0.16, 0.62), "label": "Observed controls\nX", "color": "#dbeafe"},
    "D": {"xy": (0.50, 0.62), "label": "Binary treatment\nD", "color": "#fef3c7"},
    "Y": {"xy": (0.84, 0.62), "label": "Observed outcome\nY", "color": "#fee2e2"},
    "U": {"xy": (0.50, 0.26), "label": "Hidden confounding\nrisk", "color": "#f3f4f6"},
}

edge_specs = [
    ("X", "D", "#334155", "solid", 0.00),
    ("X", "Y", "#334155", "solid", -0.12),
    ("D", "Y", "#b45309", "solid", 0.00),
    ("U", "D", "#6b7280", "dashed", 0.10),
    ("U", "Y", "#6b7280", "dashed", -0.10),
]

fig, ax = plt.subplots(figsize=(12, 5.6))
ax.set_axis_off()
box_w, box_h = 0.17, 0.115

def edge_endpoint(source_xy, target_xy, from_source=True):
    x0, y0 = source_xy
    x1, y1 = target_xy
    dx, dy = x1 - x0, y1 - y0
    scale = 1.0 / max(abs(dx) / (box_w / 2), abs(dy) / (box_h / 2))
    if from_source:
        return (x0 + dx * scale * 1.08, y0 + dy * scale * 1.08)
    return (x1 - dx * scale * 1.12, y1 - dy * scale * 1.12)

for spec in nodes.values():
    x, y_pos = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y_pos - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.02",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y_pos, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)

for start, end, color, style, rad in edge_specs:
    start_xy = edge_endpoint(nodes[start]["xy"], nodes[end]["xy"], from_source=True)
    end_xy = edge_endpoint(nodes[start]["xy"], nodes[end]["xy"], from_source=False)
    arrow = FancyArrowPatch(
        start_xy,
        end_xy,
        arrowstyle="-|>",
        mutation_scale=20,
        linewidth=1.8,
        color=color,
        linestyle=style,
        connectionstyle=f"arc3,rad={rad}",
        zorder=5,
    )
    ax.add_patch(arrow)

ax.text(
    0.50,
    0.08,
    "IRM uses observed controls to compare treated and untreated potential outcomes; hidden common causes remain a design risk.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("Interactive Regression Model Teaching Design", pad=18)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_irm_design_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The diagram has no instrument. It is an observed-control design. That means the covariates must be rich enough to make treated and untreated units comparable after adjustment.

Create A Teaching Dataset

We now simulate a binary-treatment dataset.

The treatment probability depends on observed controls, so the treated and untreated groups are not directly comparable. However, the treatment assignment is generated from observed controls only, so unconfoundedness holds by construction.

The treatment effect is heterogeneous: some units benefit more than others. This makes the ATE and ATT-style targets different.

rng = np.random.default_rng(RANDOM_STATE)
N_OBS = 4_000

feature_cols = [
    "engagement_score",
    "need_intensity",
    "content_fit",
    "recent_activity",
    "price_sensitivity",
    "tenure_signal",
    "novelty_appetite",
    "seasonality_signal",
]

X_raw = pd.DataFrame(rng.normal(size=(N_OBS, len(feature_cols))), columns=feature_cols)

true_logit = (
    -0.15
    + 0.65 * X_raw["engagement_score"]
    + 0.45 * X_raw["need_intensity"]
    - 0.50 * X_raw["price_sensitivity"]
    + 0.35 * np.sin(X_raw["content_fit"])
    + 0.25 * X_raw["recent_activity"] * X_raw["novelty_appetite"]
    - 0.20 * X_raw["seasonality_signal"]
)
true_propensity = np.clip(sigmoid(true_logit), 0.08, 0.92)
feature_exposure = rng.binomial(1, true_propensity, size=N_OBS)

true_mu0 = (
    0.80 * np.sin(X_raw["engagement_score"])
    + 0.55 * X_raw["need_intensity"] ** 2
    - 0.45 * X_raw["price_sensitivity"]
    + 0.35 * X_raw["content_fit"] * X_raw["recent_activity"]
    + 0.30 * np.cos(X_raw["seasonality_signal"])
)

true_tau = (
    0.75
    + 0.35 * sigmoid(X_raw["engagement_score"])
    + 0.30 * X_raw["need_intensity"]
    - 0.20 * X_raw["price_sensitivity"]
    + 0.15 * (X_raw["novelty_appetite"] > 0).astype(float)
)
true_mu1 = true_mu0 + true_tau

shared_noise = rng.normal(loc=0.0, scale=1.0, size=N_OBS)
potential_outcome_0 = true_mu0 + shared_noise
potential_outcome_1 = true_mu1 + shared_noise
weekly_value = np.where(feature_exposure == 1, potential_outcome_1, potential_outcome_0)

irm_df = X_raw.assign(
    feature_exposure=feature_exposure,
    weekly_value=weekly_value,
    true_propensity=true_propensity,
    true_mu0=true_mu0,
    true_mu1=true_mu1,
    true_tau=true_tau,
    potential_outcome_0=potential_outcome_0,
    potential_outcome_1=potential_outcome_1,
)

TRUE_ATE = float(irm_df["true_tau"].mean())
TRUE_ATT = float(irm_df.loc[irm_df["feature_exposure"] == 1, "true_tau"].mean())
TREATMENT_RATE = float(irm_df["feature_exposure"].mean())

irm_df.to_csv(DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_irm_data.csv", index=False)
print(f"Saved synthetic IRM data with shape {irm_df.shape}")
print(f"Treatment rate: {TREATMENT_RATE:.3f}")
print(f"True ATE: {TRUE_ATE:.3f}")
print(f"True ATT-style target: {TRUE_ATT:.3f}")
display(irm_df.head())

Saved synthetic IRM data with shape (4000, 16)
Treatment rate: 0.459
True ATE: 0.996
True ATT-style target: 1.133

	engagement_score	need_intensity	content_fit	recent_activity	price_sensitivity	tenure_signal	novelty_appetite	seasonality_signal	feature_exposure	weekly_value	true_propensity	true_mu0	true_mu1	true_tau	potential_outcome_0	potential_outcome_1
0	-0.793122	0.240571	-1.896326	1.395772	0.638295	-0.292047	-0.311949	0.303835	0	-1.008348	0.201385	-1.465574	-0.662037	0.803537	-1.008348	-0.204811
1	-0.267660	-0.225909	0.720068	0.514705	-0.064128	-0.085477	0.160916	-0.614018	0	-0.733904	0.495185	0.220266	1.217037	0.996771	-0.733904	0.262868
2	-0.403750	0.548260	-0.130483	-1.374426	-0.477279	0.656622	-0.232283	-0.148733	0	-0.257925	0.534182	0.425260	1.575338	1.150078	-0.257925	0.892153
3	0.641837	1.824610	-0.713189	1.348207	-1.230013	0.174978	-1.169530	1.351458	1	3.904644	0.692068	2.592242	4.364937	1.772695	2.131949	3.904644
4	0.833923	1.137717	-0.885533	0.684555	-0.519013	-0.457385	0.506537	0.876718	1	1.742986	0.690805	1.517669	3.106801	1.589132	0.153854	1.742986

The true ATE and ATT-style target differ because treatment is more likely for units with different expected treatment effects. That is common in observational settings: who gets treated matters.

Field Dictionary

This table documents the role of the important columns. In real work, this step prevents a common mistake: placing outcome-derived, treatment-derived, or post-treatment columns into the controls.

field_dictionary = pd.DataFrame(
    [
        {"column": "weekly_value", "role": "outcome", "description": "Observed outcome Y."},
        {"column": "feature_exposure", "role": "binary treatment", "description": "Observed binary treatment D."},
        *[
            {"column": col, "role": "control", "description": "Observed pre-treatment control X."}
            for col in feature_cols
        ],
        {"column": "true_propensity", "role": "hidden teaching column", "description": "True treatment probability P(D=1 | X)."},
        {"column": "true_mu0", "role": "hidden teaching column", "description": "True E[Y(0) | X]."},
        {"column": "true_mu1", "role": "hidden teaching column", "description": "True E[Y(1) | X]."},
        {"column": "true_tau", "role": "hidden teaching column", "description": "True individual treatment effect mu1(X) - mu0(X)."},
        {"column": "potential_outcome_0", "role": "hidden teaching column", "description": "Simulated Y(0), not observed in real data for treated units."},
        {"column": "potential_outcome_1", "role": "hidden teaching column", "description": "Simulated Y(1), not observed in real data for untreated units."},
    ]
)
save_table(field_dictionary, "field_dictionary")
display(field_dictionary)

	column	role	description
0	weekly_value	outcome	Observed outcome Y.
1	feature_exposure	binary treatment	Observed binary treatment D.
2	engagement_score	control	Observed pre-treatment control X.
3	need_intensity	control	Observed pre-treatment control X.
4	content_fit	control	Observed pre-treatment control X.
5	recent_activity	control	Observed pre-treatment control X.
6	price_sensitivity	control	Observed pre-treatment control X.
7	tenure_signal	control	Observed pre-treatment control X.
8	novelty_appetite	control	Observed pre-treatment control X.
9	seasonality_signal	control	Observed pre-treatment control X.
10	true_propensity	hidden teaching column	True treatment probability P(D=1 \| X).
11	true_mu0	hidden teaching column	True E[Y(0) \| X].
12	true_mu1	hidden teaching column	True E[Y(1) \| X].
13	true_tau	hidden teaching column	True individual treatment effect mu1(X) - mu0(X).
14	potential_outcome_0	hidden teaching column	Simulated Y(0), not observed in real data for ...
15	potential_outcome_1	hidden teaching column	Simulated Y(1), not observed in real data for ...

The controls are all pre-treatment by construction. The true potential outcome columns are excluded from every model and used only to check synthetic truth.

Basic Data Audit

Before estimating effects, we check missingness, scale, treatment support, and the true targets in the synthetic data.

In binary-treatment problems, the treated and untreated sample sizes matter. If one group is tiny, nuisance learning and overlap diagnostics become fragile.

audit_cols = ["weekly_value", "feature_exposure", "true_propensity", "true_tau"] + feature_cols

data_audit = pd.DataFrame(
    [
        {
            "column": column,
            "missing_rate": float(irm_df[column].isna().mean()),
            "mean": float(irm_df[column].mean()),
            "std": float(irm_df[column].std()),
            "min": float(irm_df[column].min()),
            "p05": float(irm_df[column].quantile(0.05)),
            "median": float(irm_df[column].median()),
            "p95": float(irm_df[column].quantile(0.95)),
            "max": float(irm_df[column].max()),
            "n_unique": int(irm_df[column].nunique()),
        }
        for column in audit_cols
    ]
)
save_table(data_audit, "data_audit")
display(data_audit)

target_summary = pd.DataFrame(
    [
        {"target": "Treatment rate", "value": TREATMENT_RATE},
        {"target": "True ATE", "value": TRUE_ATE},
        {"target": "True ATT-style target", "value": TRUE_ATT},
        {"target": "Treated count", "value": int(irm_df["feature_exposure"].sum())},
        {"target": "Untreated count", "value": int((1 - irm_df["feature_exposure"]).sum())},
    ]
)
save_table(target_summary, "target_summary")
display(target_summary)

	column	mean	std	min	p05	median	p95	max	n_unique
0	weekly_value	1.257400	1.713478	-3.970586	-1.356658	1.157832	4.236847	8.540294	4000
1	feature_exposure	0.459250	0.498399	0.000000	0.000000	0.000000	1.000000	1.000000	2
2	true_propensity	0.469063	0.208874	0.080000	0.141123	0.464328	0.819851	0.920000	3932
3	true_tau	0.996435	0.379393	-0.434153	0.376586	0.996509	1.619226	2.220412	4000
4	engagement_score	-0.017138	1.004586	-3.659392	-1.685178	-0.013643	1.577029	4.013439	4000
5	need_intensity	-0.008217	1.008047	-3.635159	-1.657538	-0.003896	1.658798	3.527216	4000
6	content_fit	0.011039	0.991930	-3.987716	-1.628396	0.014778	1.631440	3.600337	4000
7	recent_activity	-0.001082	0.998163	-3.573864	-1.594612	0.004866	1.673488	3.653308	4000
8	price_sensitivity	-0.014797	0.991680	-3.747414	-1.673546	0.006008	1.596217	3.417551	4000
9	tenure_signal	0.033066	1.014382	-4.012688	-1.662613	0.046687	1.704032	3.667511	4000
10	novelty_appetite	-0.015651	0.981362	-4.044650	-1.593314	-0.048284	1.624298	3.230977	4000
11	seasonality_signal	0.007699	1.017812	-3.372559	-1.640317	0.013726	1.687673	3.620951	4000

	target	value
0	Treatment rate	0.459250
1	True ATE	0.996435
2	True ATT-style target	1.132978
3	Treated count	1837.000000
4	Untreated count	2163.000000

The treatment rate is comfortably away from zero and one. That does not guarantee overlap everywhere, but it tells us both treatment groups have enough rows for the tutorial.

Treatment Assignment Is Confounded By X

This figure shows that the true propensity score varies with observed controls. Treated units are not a random subset of all units.

The right panel shows the distribution of the true treatment effect. Since treatment effects vary across units, the ATE and ATT-style target need not match.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.histplot(
    data=irm_df,
    x="true_propensity",
    hue="feature_exposure",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.35,
    ax=axes[0],
)
axes[0].set_title("True Propensity Score By Treatment Status")
axes[0].set_xlabel("True P(D=1 | X)")
axes[0].set_ylabel("Density")

sns.histplot(
    data=irm_df,
    x="true_tau",
    hue="feature_exposure",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.35,
    ax=axes[1],
)
axes[1].axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE")
axes[1].axvline(TRUE_ATT, color="#b91c1c", linestyle="--", linewidth=1.5, label="True ATT-style")
axes[1].set_title("Treatment Effect Heterogeneity")
axes[1].set_xlabel("True Individual Effect")
axes[1].legend()

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_propensity_and_effect_heterogeneity.png", dpi=160, bbox_inches="tight")
plt.show()

The treated group is shifted toward higher propensities and somewhat different effects. That is exactly why a raw mean difference can miss the ATE.

Correlation Screen

A correlation matrix is only a rough diagnostic, especially with nonlinear relationships. Still, it helps us see how treatment, outcome, true propensity, true effects, and controls relate in the synthetic data.

corr_cols = ["weekly_value", "feature_exposure", "true_propensity", "true_tau", "true_mu0", "true_mu1"] + feature_cols
corr_matrix = irm_df[corr_cols].corr()
save_table(corr_matrix.reset_index().rename(columns={"index": "column"}), "design_correlation_matrix")

fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(
    corr_matrix,
    cmap="vlag",
    center=0,
    annot=False,
    linewidths=0.4,
    cbar_kws={"label": "Correlation"},
    ax=ax,
)
ax.set_title("Correlation Screen For IRM Teaching Data")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_design_correlation_matrix.png", dpi=160, bbox_inches="tight")
plt.show()

The matrix shows why simple comparison is risky: treatment is associated with observed controls that are also related to the outcome.

Baseline Estimators

We start with simple baselines:

Difference in means between treated and untreated units.
OLS adjustment with treatment and raw controls.
Oracle regression adjustment using the true outcome nuisance functions.
Oracle AIPW using the true propensity and true outcome nuisance functions.

The oracle rows are available only because the data are synthetic. They show the target behavior when nuisance functions are known.

y = irm_df["weekly_value"].to_numpy()
d = irm_df["feature_exposure"].to_numpy()
X = irm_df[feature_cols]
true_p = irm_df["true_propensity"].to_numpy()
true_g0 = irm_df["true_mu0"].to_numpy()
true_g1 = irm_df["true_mu1"].to_numpy()

baseline_rows = []
baseline_rows.append(mean_difference_summary(y, d, "Difference in means"))
baseline_rows.append(treatment_ols_summary(irm_df["weekly_value"], irm_df[["feature_exposure"] + feature_cols], "feature_exposure", "OLS with raw controls"))
baseline_rows.append(
    {
        "estimator": "Oracle regression adjustment",
        "target": "ATE",
        "theta_hat": float(np.mean(true_g1 - true_g0)),
        "std_error": float(np.std(true_g1 - true_g0, ddof=1) / np.sqrt(len(true_g1))),
        "ci_95_lower": float(np.mean(true_g1 - true_g0) - 1.96 * np.std(true_g1 - true_g0, ddof=1) / np.sqrt(len(true_g1))),
        "ci_95_upper": float(np.mean(true_g1 - true_g0) + 1.96 * np.std(true_g1 - true_g0, ddof=1) / np.sqrt(len(true_g1))),
        "p_value": np.nan,
    }
)
baseline_rows.append(aipw_ate(y, d, true_g0, true_g1, true_p, "Oracle AIPW"))
baseline_rows.append(aipw_att(y, d, true_g0, true_p, "Oracle ATT-style AIPW"))

baseline_estimates = pd.DataFrame(baseline_rows)
baseline_estimates["true_target"] = np.where(baseline_estimates["target"].eq("ATT-style"), TRUE_ATT, TRUE_ATE)
baseline_estimates["bias_vs_target"] = baseline_estimates["theta_hat"] - baseline_estimates["true_target"]
save_table(baseline_estimates, "baseline_estimates")
display(baseline_estimates)

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	true_target	bias_vs_target
0	Difference in means	unadjusted mean difference	1.595003	0.048559	1.499827	1.690179	NaN	0.996435	0.598568
1	OLS with raw controls	raw regression slope	1.045854	0.046136	0.955429	1.136279	9.075105e-114	0.996435	0.049419
2	Oracle regression adjustment	ATE	0.996435	0.005999	0.984677	1.008192	NaN	0.996435	0.000000
3	Oracle AIPW	ATE	0.945411	0.037036	0.872821	1.018001	NaN	0.996435	-0.051024
4	Oracle ATT-style AIPW	ATT-style	1.119326	0.045365	1.030411	1.208241	NaN	1.132978	-0.013652

The raw mean difference is far from the ATE because treatment assignment depends on covariates. Adjustment moves the estimate toward the causal target, and the oracle rows show what happens when the nuisance functions are known.

Baseline Estimate Plot

This figure shows the baseline estimates. The blue dashed line marks the true ATE. The red dashed line marks the true ATT-style target.

The different reference lines matter because ATE and ATT-style targets are different in this dataset.

plot_baselines = baseline_estimates.copy().sort_values("theta_hat")
plot_baselines["lower_error"] = plot_baselines["theta_hat"] - plot_baselines["ci_95_lower"]
plot_baselines["upper_error"] = plot_baselines["ci_95_upper"] - plot_baselines["theta_hat"]

fig, ax = plt.subplots(figsize=(10.8, 5.8))
y_pos = np.arange(len(plot_baselines))
ax.errorbar(
    x=plot_baselines["theta_hat"],
    y=y_pos,
    xerr=[plot_baselines["lower_error"], plot_baselines["upper_error"]],
    fmt="o",
    color="#0f172a",
    ecolor="#475569",
    capsize=4,
)
ax.axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE")
ax.axvline(TRUE_ATT, color="#b91c1c", linestyle="--", linewidth=1.5, label="True ATT-style")
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_baselines["estimator"])
ax.set_xlabel("Estimated Effect")
ax.set_title("Baseline Binary-Treatment Estimates")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_baseline_estimates.png", dpi=160, bbox_inches="tight")
plt.show()

The plot makes the estimand distinction visible. A raw difference can be biased, and an ATT-style target can legitimately differ from the ATE when treatment effects vary.

Nuisance Learners

We will compare two nuisance specifications:

regularized linear/logistic learners;
histogram gradient boosting learners.

ml_g is a regressor because the outcome is continuous. ml_m is a classifier because it estimates the propensity score.

linear_outcome_learner = make_pipeline(
    StandardScaler(),
    LassoCV(cv=3, random_state=RANDOM_STATE, max_iter=5_000),
)
linear_propensity_learner = make_pipeline(
    StandardScaler(),
    LogisticRegressionCV(
        cv=3,
        max_iter=2_000,
        scoring="neg_log_loss",
        random_state=RANDOM_STATE,
    ),
)

hgb_outcome_learner = HistGradientBoostingRegressor(
    max_iter=240,
    learning_rate=0.04,
    max_leaf_nodes=24,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
hgb_propensity_learner = HistGradientBoostingClassifier(
    max_iter=240,
    learning_rate=0.04,
    max_leaf_nodes=24,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)

learner_catalog = pd.DataFrame(
    [
        {"learner_name": "Regularized linear outcome + logistic propensity", "use_case": "Transparent baseline; may underfit nonlinear nuisance functions."},
        {"learner_name": "Gradient-boosted outcome + propensity", "use_case": "Captures nonlinearities and interactions in this synthetic design."},
    ]
)
save_table(learner_catalog, "learner_catalog")
display(learner_catalog)

	learner_name	use_case
0	Regularized linear outcome + logistic propensity	Transparent baseline; may underfit nonlinear n...
1	Gradient-boosted outcome + propensity	Captures nonlinearities and interactions in th...

The nuisance learners are not the final estimand. They are tools for estimating the outcome regressions and propensity score used in the orthogonal score.

Manual Cross-Fitted AIPW

Before fitting DoubleML, we manually compute cross-fitted AIPW estimates.

The manual procedure estimates three nuisance quantities out of fold:

g0_hat(X): predicted outcome if untreated;
g1_hat(X): predicted outcome if treated;
m_hat(X): predicted probability of treatment.

Each row’s nuisance predictions come from models that did not train on that row.

manual_g0_hat, manual_g1_hat, manual_m_hat = crossfit_irm_nuisances(
    X=X,
    y=y,
    d=d,
    outcome_learner=hgb_outcome_learner,
    propensity_learner=hgb_propensity_learner,
    n_splits=5,
    random_state=RANDOM_STATE,
)

manual_estimates = pd.DataFrame(
    [
        ipw_ate(y, d, manual_m_hat, "Manual cross-fitted IPW"),
        aipw_ate(y, d, manual_g0_hat, manual_g1_hat, manual_m_hat, "Manual cross-fitted AIPW"),
        aipw_att(y, d, manual_g0_hat, manual_m_hat, "Manual cross-fitted ATT-style AIPW"),
    ]
)
manual_estimates["true_target"] = np.where(manual_estimates["target"].eq("ATT-style"), TRUE_ATT, TRUE_ATE)
manual_estimates["bias_vs_target"] = manual_estimates["theta_hat"] - manual_estimates["true_target"]
save_table(manual_estimates, "manual_crossfit_estimates")
display(manual_estimates)

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	true_target	bias_vs_target
0	Manual cross-fitted IPW	ATE	1.039355	0.087898	0.867075	1.211634	NaN	0.996435	0.04292
1	Manual cross-fitted AIPW	ATE	0.961524	0.052736	0.858163	1.064886	NaN	0.996435	-0.03491
2	Manual cross-fitted ATT-style AIPW	ATT-style	1.116128	0.064518	0.989673	1.242583	NaN	1.132978	-0.01685

The AIPW estimate uses both outcome regression and propensity weighting. In this synthetic design, it is much closer to the target than the raw mean difference.

Manual Nuisance Quality

Because the data are synthetic, we can compare the cross-fitted nuisance predictions to the true nuisance functions.

In real data, the true nuisance functions are unknown. We would instead inspect out-of-fold errors, propensity calibration, overlap, and domain plausibility.

manual_nuisance_quality = pd.DataFrame(
    [
        {
            "nuisance": "g0(X) = E[Y(0) | X]",
            "rmse_vs_truth": rmse(true_g0, manual_g0_hat),
            "mae_vs_truth": float(mean_absolute_error(true_g0, manual_g0_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_g0, manual_g0_hat)[0, 1]),
        },
        {
            "nuisance": "g1(X) = E[Y(1) | X]",
            "rmse_vs_truth": rmse(true_g1, manual_g1_hat),
            "mae_vs_truth": float(mean_absolute_error(true_g1, manual_g1_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_g1, manual_g1_hat)[0, 1]),
        },
        {
            "nuisance": "m(X) = P(D=1 | X)",
            "rmse_vs_truth": rmse(true_p, manual_m_hat),
            "mae_vs_truth": float(mean_absolute_error(true_p, manual_m_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_p, manual_m_hat)[0, 1]),
            "auc_for_treatment": float(roc_auc_score(d, manual_m_hat)),
            "log_loss_for_treatment": float(log_loss(d, manual_m_hat)),
        },
    ]
)
save_table(manual_nuisance_quality, "manual_nuisance_quality")
display(manual_nuisance_quality)

	nuisance	rmse_vs_truth	mae_vs_truth	correlation_vs_truth	auc_for_treatment	log_loss_for_treatment
0	g0(X) = E[Y(0) \| X]	0.437118	0.326555	0.917330	NaN	NaN
1	g1(X) = E[Y(1) \| X]	0.500857	0.370106	0.922514	NaN	NaN
2	m(X) = P(D=1 \| X)	0.127586	0.099716	0.846321	0.703504	0.635537

The nuisance models recover the broad structure but are not perfect. AIPW is designed to tolerate nuisance error better than a simple plug-in estimator, especially with cross-fitting.

Propensity Overlap Diagnostics

Overlap means treated and untreated units exist across the relevant covariate space. Propensity scores near zero or one create large inverse-propensity weights and unstable estimates.

This cell summarizes the manual cross-fitted propensity predictions and the implied weights.

weight_treated = d / manual_m_hat
weight_untreated = (1 - d) / (1 - manual_m_hat)
combined_ipw_weight = weight_treated + weight_untreated

propensity_diagnostics = pd.DataFrame(
    [
        {
            "diagnostic": "predicted propensity minimum",
            "value": float(np.min(manual_m_hat)),
        },
        {
            "diagnostic": "predicted propensity p05",
            "value": float(np.quantile(manual_m_hat, 0.05)),
        },
        {
            "diagnostic": "predicted propensity median",
            "value": float(np.median(manual_m_hat)),
        },
        {
            "diagnostic": "predicted propensity p95",
            "value": float(np.quantile(manual_m_hat, 0.95)),
        },
        {
            "diagnostic": "predicted propensity maximum",
            "value": float(np.max(manual_m_hat)),
        },
        {
            "diagnostic": "IPW effective sample size",
            "value": weighted_effective_sample_size(combined_ipw_weight),
        },
        {
            "diagnostic": "maximum IPW weight",
            "value": float(np.max(combined_ipw_weight)),
        },
    ]
)
save_table(propensity_diagnostics, "manual_propensity_diagnostics")
display(propensity_diagnostics)

	diagnostic	value
0	predicted propensity minimum	0.020000
1	predicted propensity p05	0.096206
2	predicted propensity median	0.440074
3	predicted propensity p95	0.863000
4	predicted propensity maximum	0.980000
5	IPW effective sample size	2299.346366
6	maximum IPW weight	33.762843

The predicted propensities stay away from zero and one. That is a healthy sign for this teaching dataset. In real data, this table is one of the first places to look for overlap trouble.

Propensity And Weight Plots

The left panel shows propensity overlap by treatment status. The right panel shows inverse-propensity weights on a log scale.

A small number of extreme weights can dominate an estimate, so weight diagnostics should be routine in IRM workflows.

propensity_plot_df = pd.DataFrame(
    {
        "predicted_propensity": manual_m_hat,
        "feature_exposure": d,
        "ipw_weight": combined_ipw_weight,
    }
)
propensity_plot_df["treatment_status"] = np.where(propensity_plot_df["feature_exposure"] == 1, "treated", "untreated")

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
sns.histplot(
    data=propensity_plot_df,
    x="predicted_propensity",
    hue="treatment_status",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.35,
    ax=axes[0],
)
axes[0].set_title("Cross-Fitted Propensity Overlap")
axes[0].set_xlabel("Predicted P(D=1 | X)")

sns.histplot(
    data=propensity_plot_df,
    x="ipw_weight",
    hue="treatment_status",
    bins=60,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=axes[1],
)
axes[1].set_xscale("log")
axes[1].set_title("Inverse-Propensity Weight Distribution")
axes[1].set_xlabel("IPW Weight, log scale")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_propensity_overlap_and_weights.png", dpi=160, bbox_inches="tight")
plt.show()

The overlap is not perfect, but both treatment groups have support over much of the propensity range. The weight plot does not show a severe tail problem.

Outcome Nuisance Diagnostics

The next figure compares the manually cross-fitted outcome nuisance predictions against the true outcome nuisance functions.

The treated outcome model is trained only on treated rows, and the untreated outcome model is trained only on untreated rows. Cross-fitting keeps the displayed predictions out of fold.

sample_idx = irm_df.sample(1_000, random_state=RANDOM_STATE).index

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
axes[0].scatter(true_g0[sample_idx], manual_g0_hat[sample_idx], alpha=0.25, s=18, color="#2563eb")
lower0 = min(np.quantile(true_g0, 0.01), np.quantile(manual_g0_hat, 0.01))
upper0 = max(np.quantile(true_g0, 0.99), np.quantile(manual_g0_hat, 0.99))
axes[0].plot([lower0, upper0], [lower0, upper0], color="#b91c1c", linestyle="--", linewidth=1.4)
axes[0].set_title("Untreated Outcome Nuisance")
axes[0].set_xlabel("True g0(X)")
axes[0].set_ylabel("Predicted g0_hat(X)")

axes[1].scatter(true_g1[sample_idx], manual_g1_hat[sample_idx], alpha=0.25, s=18, color="#047857")
lower1 = min(np.quantile(true_g1, 0.01), np.quantile(manual_g1_hat, 0.01))
upper1 = max(np.quantile(true_g1, 0.99), np.quantile(manual_g1_hat, 0.99))
axes[1].plot([lower1, upper1], [lower1, upper1], color="#b91c1c", linestyle="--", linewidth=1.4)
axes[1].set_title("Treated Outcome Nuisance")
axes[1].set_xlabel("True g1(X)")
axes[1].set_ylabel("Predicted g1_hat(X)")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_manual_outcome_nuisance_quality.png", dpi=160, bbox_inches="tight")
plt.show()

The outcome models capture most of the pattern, with some shrinkage at the tails. That is typical for tree-based learners and finite data.

Build The DoubleML Data Object

The DoubleMLData object stores the role assignment:

weekly_value is the outcome.
feature_exposure is the binary treatment.
the feature columns are pre-treatment controls.

Hidden teaching columns are excluded.

dml_data = DoubleMLData(
    irm_df[["weekly_value", "feature_exposure"] + feature_cols],
    y_col="weekly_value",
    d_cols="feature_exposure",
    x_cols=feature_cols,
)

backend_summary = pd.DataFrame(
    [
        {"role": "outcome", "columns": dml_data.y_col},
        {"role": "treatment", "columns": ", ".join(dml_data.d_cols)},
        {"role": "controls", "columns": ", ".join(dml_data.x_cols)},
        {"role": "number_of_observations", "columns": str(dml_data.n_obs)},
    ]
)
save_table(backend_summary, "doublemldata_backend_summary")
display(backend_summary)

	role	columns
0	outcome	weekly_value
1	treatment	feature_exposure
2	controls	engagement_score, need_intensity, content_fit,...
3	number_of_observations	4000

This backend is intentionally simple: one outcome, one binary treatment, and observed controls. The real work is deciding whether those controls support an observed-control causal design.

Fit `DoubleMLIRM` For ATE And ATTE

We now fit four DoubleML models:

linear/logistic nuisance ATE;
linear/logistic nuisance ATTE;
gradient-boosted nuisance ATE;
gradient-boosted nuisance ATTE.

The score option changes the target. score="ATE" targets the population average effect. score="ATTE" targets the effect among treated units.

irm_model_specs = {
    "Linear nuisance IRM ATE": {
        "ml_g": linear_outcome_learner,
        "ml_m": linear_propensity_learner,
        "score": "ATE",
        "true_target": TRUE_ATE,
    },
    "Linear nuisance IRM ATTE": {
        "ml_g": linear_outcome_learner,
        "ml_m": linear_propensity_learner,
        "score": "ATTE",
        "true_target": TRUE_ATT,
    },
    "Gradient boosting nuisance IRM ATE": {
        "ml_g": hgb_outcome_learner,
        "ml_m": hgb_propensity_learner,
        "score": "ATE",
        "true_target": TRUE_ATE,
    },
    "Gradient boosting nuisance IRM ATTE": {
        "ml_g": hgb_outcome_learner,
        "ml_m": hgb_propensity_learner,
        "score": "ATTE",
        "true_target": TRUE_ATT,
    },
}

irm_models = {}
for model_name, spec in irm_model_specs.items():
    model = DoubleMLIRM(
        dml_data,
        ml_g=clone(spec["ml_g"]),
        ml_m=clone(spec["ml_m"]),
        n_folds=5,
        n_rep=1,
        score=spec["score"],
    )
    model.fit(store_predictions=True)
    irm_models[model_name] = model
    print(f"Finished: {model_name}")

irm_summary_rows = []
for model_name, model in irm_models.items():
    summary = model.summary.reset_index().rename(columns={"index": "treatment"})
    row = summary.iloc[0].to_dict()
    true_target = irm_model_specs[model_name]["true_target"]
    irm_summary_rows.append(
        {
            "estimator": model_name,
            "score": irm_model_specs[model_name]["score"],
            "treatment": row["treatment"],
            "theta_hat": row["coef"],
            "std_error": row["std err"],
            "t_stat": row["t"],
            "p_value": row["P>|t|"],
            "ci_95_lower": row["2.5 %"],
            "ci_95_upper": row["97.5 %"],
            "true_target": true_target,
            "bias_vs_target": row["coef"] - true_target,
        }
    )

irm_summary = pd.DataFrame(irm_summary_rows)
save_table(irm_summary, "doubleml_irm_summary")
display(irm_summary)

Finished: Linear nuisance IRM ATE
Finished: Linear nuisance IRM ATTE
Finished: Gradient boosting nuisance IRM ATE
Finished: Gradient boosting nuisance IRM ATTE

	estimator	score	treatment	theta_hat	std_error	t_stat	p_value	ci_95_lower	ci_95_upper	true_target	bias_vs_target
0	Linear nuisance IRM ATE	ATE	feature_exposure	1.027968	0.056428	18.217240	3.766794e-74	0.917370	1.138565	0.996435	0.031533
1	Linear nuisance IRM ATTE	ATTE	feature_exposure	1.162320	0.057914	20.069872	1.353534e-89	1.048812	1.275829	1.132978	0.029342
2	Gradient boosting nuisance IRM ATE	ATE	feature_exposure	0.985272	0.053888	18.283853	1.112793e-74	0.879654	1.090890	0.996435	-0.011163
3	Gradient boosting nuisance IRM ATTE	ATTE	feature_exposure	1.048266	0.069878	15.001366	7.192421e-51	0.911308	1.185224	1.132978	-0.084712

The gradient-boosted nuisance specification is expected to do well because the synthetic treatment assignment and outcome functions are nonlinear. The linear/logistic specification remains useful as a transparent baseline.

Compare All Estimators

This table combines simple baselines, manual cross-fitted estimates, and DoubleML estimates.

When reading the table, compare each row to the right target. ATE rows should be compared with the true ATE. ATT-style and ATTE rows should be compared with the true treated target.

comparison_rows = []
comparison_rows.append(baseline_estimates)
comparison_rows.append(manual_estimates)
comparison_rows.append(
    irm_summary.rename(columns={"score": "target"})[
        ["estimator", "target", "theta_hat", "std_error", "ci_95_lower", "ci_95_upper", "p_value", "true_target", "bias_vs_target"]
    ]
)

all_estimates = pd.concat(comparison_rows, ignore_index=True, sort=False)
all_estimates["target"] = all_estimates["target"].replace({"ATTE": "ATT-style"})
all_estimates["abs_bias_vs_target"] = all_estimates["bias_vs_target"].abs()
save_table(all_estimates, "all_estimator_comparison")
display(all_estimates.sort_values("abs_bias_vs_target"))

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	true_target	bias_vs_target	abs_bias_vs_target
2	Oracle regression adjustment	ATE	0.996435	0.005999	0.984677	1.008192	NaN	0.996435	0.000000	0.000000
10	Gradient boosting nuisance IRM ATE	ATE	0.985272	0.053888	0.879654	1.090890	1.112793e-74	0.996435	-0.011163	0.011163
4	Oracle ATT-style AIPW	ATT-style	1.119326	0.045365	1.030411	1.208241	NaN	1.132978	-0.013652	0.013652
7	Manual cross-fitted ATT-style AIPW	ATT-style	1.116128	0.064518	0.989673	1.242583	NaN	1.132978	-0.016850	0.016850
9	Linear nuisance IRM ATTE	ATT-style	1.162320	0.057914	1.048812	1.275829	1.353534e-89	1.132978	0.029342	0.029342
8	Linear nuisance IRM ATE	ATE	1.027968	0.056428	0.917370	1.138565	3.766794e-74	0.996435	0.031533	0.031533
6	Manual cross-fitted AIPW	ATE	0.961524	0.052736	0.858163	1.064886	NaN	0.996435	-0.034910	0.034910
5	Manual cross-fitted IPW	ATE	1.039355	0.087898	0.867075	1.211634	NaN	0.996435	0.042920	0.042920
1	OLS with raw controls	raw regression slope	1.045854	0.046136	0.955429	1.136279	9.075105e-114	0.996435	0.049419	0.049419
3	Oracle AIPW	ATE	0.945411	0.037036	0.872821	1.018001	NaN	0.996435	-0.051024	0.051024
11	Gradient boosting nuisance IRM ATTE	ATT-style	1.048266	0.069878	0.911308	1.185224	7.192421e-51	1.132978	-0.084712	0.084712
0	Difference in means	unadjusted mean difference	1.595003	0.048559	1.499827	1.690179	NaN	0.996435	0.598568	0.598568

The comparison shows the point of IRM: use outcome and propensity nuisance models to adjust for the observed differences between treated and untreated units.

Estimate Comparison Plot

The next figure compares estimates with confidence intervals. The vertical reference lines mark the true ATE and true ATT-style target from the synthetic data.

plot_estimates = all_estimates.copy().sort_values("theta_hat")
plot_estimates["lower_error"] = plot_estimates["theta_hat"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["theta_hat"]

fig, ax = plt.subplots(figsize=(11, 7.2))
y_pos = np.arange(len(plot_estimates))
ax.errorbar(
    x=plot_estimates["theta_hat"],
    y=y_pos,
    xerr=[plot_estimates["lower_error"], plot_estimates["upper_error"]],
    fmt="o",
    color="#0f172a",
    ecolor="#475569",
    capsize=4,
)
ax.axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE")
ax.axvline(TRUE_ATT, color="#b91c1c", linestyle="--", linewidth=1.5, label="True ATT-style")
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_estimates["estimator"])
ax.set_xlabel("Estimated Effect")
ax.set_title("IRM Estimate Comparison")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_irm_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The plot shows both bias reduction and estimand differences. ATE and ATT-style estimates should not be forced to agree when treatment effects are heterogeneous.

DoubleML Nuisance Losses

evaluate_learners() reports out-of-fold nuisance losses. For IRM, the keys are:

ml_g0: untreated outcome nuisance;
ml_g1: treated outcome nuisance;
ml_m: propensity score nuisance.

The loss values are diagnostics, not causal proof. A good propensity model cannot fix unobserved confounding.

loss_tables = []
for model_name, model in irm_models.items():
    loss_tables.append(learner_loss_table(model, model_name))

nuisance_losses = pd.concat(loss_tables, ignore_index=True)
save_table(nuisance_losses, "doubleml_nuisance_losses")
display(nuisance_losses)

	model	learner	mean_loss	min_loss	max_loss
0	Linear nuisance IRM ATE	ml_g0	1.314451	1.314451	1.314451
1	Linear nuisance IRM ATE	ml_g1	1.310666	1.310666	1.310666
2	Linear nuisance IRM ATE	ml_m	0.456163	0.456163	0.456163
3	Linear nuisance IRM ATTE	ml_g0	1.312939	1.312939	1.312939
4	Linear nuisance IRM ATTE	ml_g1	1.309140	1.309140	1.309140
5	Linear nuisance IRM ATTE	ml_m	0.455825	0.455825	0.455825
6	Gradient boosting nuisance IRM ATE	ml_g0	1.086616	1.086616	1.086616
7	Gradient boosting nuisance IRM ATE	ml_g1	1.063938	1.063938	1.063938
8	Gradient boosting nuisance IRM ATE	ml_m	0.470521	0.470521	0.470521
9	Gradient boosting nuisance IRM ATTE	ml_g0	1.086821	1.086821	1.086821
10	Gradient boosting nuisance IRM ATTE	ml_g1	1.070723	1.070723	1.070723
11	Gradient boosting nuisance IRM ATTE	ml_m	0.469567	0.469567	0.469567

The loss table helps compare nuisance specifications. Lower nuisance loss is useful, but the final causal estimate also depends on overlap and identification assumptions.

Extract DoubleML Predictions

We now pull the nuisance predictions from the gradient-boosted ATE model. These predictions are useful for diagnostics and for connecting DoubleML output back to the manual AIPW formula.

irm_hgb_ate = irm_models["Gradient boosting nuisance IRM ATE"]
dml_g0_hat = prediction_vector(irm_hgb_ate, "ml_g0")
dml_g1_hat = prediction_vector(irm_hgb_ate, "ml_g1")
dml_m_hat = np.clip(prediction_vector(irm_hgb_ate, "ml_m"), 0.02, 0.98)

dml_prediction_quality = pd.DataFrame(
    [
        {
            "nuisance": "ml_g0 untreated outcome",
            "rmse_vs_truth": rmse(true_g0, dml_g0_hat),
            "mae_vs_truth": float(mean_absolute_error(true_g0, dml_g0_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_g0, dml_g0_hat)[0, 1]),
        },
        {
            "nuisance": "ml_g1 treated outcome",
            "rmse_vs_truth": rmse(true_g1, dml_g1_hat),
            "mae_vs_truth": float(mean_absolute_error(true_g1, dml_g1_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_g1, dml_g1_hat)[0, 1]),
        },
        {
            "nuisance": "ml_m propensity",
            "rmse_vs_truth": rmse(true_p, dml_m_hat),
            "mae_vs_truth": float(mean_absolute_error(true_p, dml_m_hat)),
            "correlation_vs_truth": float(np.corrcoef(true_p, dml_m_hat)[0, 1]),
            "auc_for_treatment": float(roc_auc_score(d, dml_m_hat)),
            "log_loss_for_treatment": float(log_loss(d, dml_m_hat)),
        },
    ]
)
save_table(dml_prediction_quality, "hgb_nuisance_quality_against_truth")
display(dml_prediction_quality)

	nuisance	rmse_vs_truth	mae_vs_truth	correlation_vs_truth	auc_for_treatment	log_loss_for_treatment
0	ml_g0 untreated outcome	0.445073	0.329864	0.914579	NaN	NaN
1	ml_g1 treated outcome	0.496318	0.366543	0.922845	NaN	NaN
2	ml_m propensity	0.125739	0.098033	0.849808	0.702208	0.637294

The DoubleML nuisance predictions have similar quality to the manual cross-fitted nuisance predictions. That is expected because they use the same learner family and the same core cross-fitting idea.

DoubleML Propensity Diagnostics

This cell summarizes the propensity predictions from the main DoubleML ATE model and the corresponding weights.

dml_ipw_weight = d / dml_m_hat + (1 - d) / (1 - dml_m_hat)

dml_propensity_diagnostics = pd.DataFrame(
    [
        {"diagnostic": "predicted propensity minimum", "value": float(np.min(dml_m_hat))},
        {"diagnostic": "predicted propensity p05", "value": float(np.quantile(dml_m_hat, 0.05))},
        {"diagnostic": "predicted propensity median", "value": float(np.median(dml_m_hat))},
        {"diagnostic": "predicted propensity p95", "value": float(np.quantile(dml_m_hat, 0.95))},
        {"diagnostic": "predicted propensity maximum", "value": float(np.max(dml_m_hat))},
        {"diagnostic": "IPW effective sample size", "value": weighted_effective_sample_size(dml_ipw_weight)},
        {"diagnostic": "maximum IPW weight", "value": float(np.max(dml_ipw_weight))},
    ]
)
save_table(dml_propensity_diagnostics, "doubleml_propensity_diagnostics")
display(dml_propensity_diagnostics)

	diagnostic	value
0	predicted propensity minimum	0.020000
1	predicted propensity p05	0.100421
2	predicted propensity median	0.440048
3	predicted propensity p95	0.862390
4	predicted propensity maximum	0.980000
5	IPW effective sample size	2225.352824
6	maximum IPW weight	41.088665

The propensity distribution is well behaved here. In real data, high maximum weights or a low effective sample size would lead to trimming, redesign, or a more limited target population.

DoubleML Nuisance Plots

This figure checks the gradient-boosted DoubleML nuisance predictions against true nuisance functions from the simulation.

The third panel compares predicted and true propensities. This is a useful calibration-style check in synthetic data.

sample_idx = irm_df.sample(1_000, random_state=RANDOM_STATE).index

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
plot_specs = [
    (true_g0, dml_g0_hat, "Untreated Outcome Nuisance", "True g0(X)", "Predicted g0_hat(X)", "#2563eb"),
    (true_g1, dml_g1_hat, "Treated Outcome Nuisance", "True g1(X)", "Predicted g1_hat(X)", "#047857"),
    (true_p, dml_m_hat, "Propensity Nuisance", "True P(D=1 | X)", "Predicted m_hat(X)", "#b45309"),
]

for ax, (truth, pred, title, xlabel, ylabel, color) in zip(axes, plot_specs):
    ax.scatter(truth[sample_idx], pred[sample_idx], alpha=0.25, s=18, color=color)
    lower = min(np.quantile(truth, 0.01), np.quantile(pred, 0.01))
    upper = max(np.quantile(truth, 0.99), np.quantile(pred, 0.99))
    ax.plot([lower, upper], [lower, upper], color="#b91c1c", linestyle="--", linewidth=1.4)
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_doubleml_nuisance_prediction_quality.png", dpi=160, bbox_inches="tight")
plt.show()

The nuisance plots show useful but imperfect recovery. That is normal: orthogonal scores are built for realistic nuisance error, not perfect prediction.

Score Contributions

DoubleML stores orthogonal score contributions in psi. At the fitted estimate, the average score should be close to zero.

Large tails in the score distribution can point to extreme weights, poor overlap, or influential observations.

psi_values = np.asarray(irm_hgb_ate.psi)[:, 0, 0]
psi_summary = pd.DataFrame(
    [
        {
            "mean": float(np.mean(psi_values)),
            "std": float(np.std(psi_values, ddof=1)),
            "p01": float(np.quantile(psi_values, 0.01)),
            "p05": float(np.quantile(psi_values, 0.05)),
            "median": float(np.median(psi_values)),
            "p95": float(np.quantile(psi_values, 0.95)),
            "p99": float(np.quantile(psi_values, 0.99)),
        }
    ]
)
save_table(psi_summary, "orthogonal_score_summary")
display(psi_summary)

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(psi_values, bins=80, color="#2563eb", alpha=0.75, ax=ax)
ax.axvline(0, color="#111827", linewidth=1, linestyle="--")
ax.set_title("IRM Orthogonal Score Contributions")
ax.set_xlabel("psi_i")
ax.set_ylabel("Count")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_orthogonal_score_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

	mean	std	p01	p05	median	p95	p99
0	-3.552714e-17	3.408574	-9.700456	-4.214624	-0.035836	4.012218	9.763326

The score is centered near zero, as expected. The tails are not extreme in this well-overlapped synthetic dataset.

Bootstrap Confidence Intervals

DoubleML can compute bootstrap confidence intervals. We run a moderate bootstrap for the main ATE and ATTE models.

Bootstrap uncertainty still assumes the design target is valid. It does not diagnose hidden confounding.

bootstrap_rows = []
for model_name in ["Gradient boosting nuisance IRM ATE", "Gradient boosting nuisance IRM ATTE"]:
    model = irm_models[model_name]
    model.bootstrap(method="normal", n_rep_boot=500)
    ci = model.confint(joint=False, level=0.95).reset_index().rename(columns={"index": "treatment"})
    ci["estimator"] = f"{model_name} bootstrap"
    ci["score"] = irm_model_specs[model_name]["score"]
    bootstrap_rows.append(ci)

bootstrap_ci = pd.concat(bootstrap_rows, ignore_index=True)
save_table(bootstrap_ci, "hgb_bootstrap_confidence_intervals")
display(bootstrap_ci)

	treatment	2.5 %	97.5 %	estimator	score
0	feature_exposure	0.879654	1.090890	Gradient boosting nuisance IRM ATE bootstrap	ATE
1	feature_exposure	0.911308	1.185224	Gradient boosting nuisance IRM ATTE bootstrap	ATTE

The bootstrap intervals are another uncertainty summary around the same observed-control design. They should be reported alongside overlap and nuisance diagnostics.

Repeated Sample Splitting

Cross-fitted estimates can vary slightly with the fold split. Repeated sample splitting checks whether the result is numerically stable across several random splits.

This cell uses a lighter gradient-boosting configuration to keep the check fast.

repeated_outcome_learner = HistGradientBoostingRegressor(
    max_iter=150,
    learning_rate=0.05,
    max_leaf_nodes=22,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)
repeated_propensity_learner = HistGradientBoostingClassifier(
    max_iter=150,
    learning_rate=0.05,
    max_leaf_nodes=22,
    l2_regularization=0.001,
    random_state=RANDOM_STATE,
)

irm_repeated = DoubleMLIRM(
    dml_data,
    ml_g=clone(repeated_outcome_learner),
    ml_m=clone(repeated_propensity_learner),
    n_folds=5,
    n_rep=5,
    score="ATE",
)
irm_repeated.fit(store_predictions=False)

split_stability = pd.DataFrame(
    {
        "repetition": np.arange(1, irm_repeated.all_coef.shape[1] + 1),
        "theta_hat": irm_repeated.all_coef[0],
        "std_error": irm_repeated.all_se[0],
    }
)
split_stability["true_ate"] = TRUE_ATE
split_stability["bias_vs_true_ate"] = split_stability["theta_hat"] - TRUE_ATE
save_table(split_stability, "repeated_sample_split_stability")
display(split_stability)

	repetition	theta_hat	std_error	true_ate	bias_vs_true_ate
0	1	0.947686	0.047389	0.996435	-0.048749
1	2	0.956732	0.049821	0.996435	-0.039703
2	3	0.981438	0.048293	0.996435	-0.014996
3	4	0.979799	0.048529	0.996435	-0.016635
4	5	0.943649	0.047064	0.996435	-0.052786

The repeated estimates should be close if the nuisance models and overlap are stable. Large movement would suggest a fragile workflow.

Split Stability Plot

The next plot shows the repeated-split ATE estimates against the true ATE reference line.

fig, ax = plt.subplots(figsize=(8.5, 4.8))
sns.lineplot(data=split_stability, x="repetition", y="theta_hat", marker="o", color="#2563eb", ax=ax)
ax.axhline(TRUE_ATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True ATE")
ax.set_title("IRM Repeated Sample-Splitting Stability")
ax.set_xlabel("Repetition")
ax.set_ylabel("Estimated ATE")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_repeated_split_stability.png", dpi=160, bbox_inches="tight")
plt.show()

The estimates are stable across split repetitions in this synthetic setting. That supports numerical reliability, not causal identification by itself.

Overlap Stress Test

Now we isolate the overlap problem. We keep the same potential outcomes but make the treatment assignment rule more or less extreme.

This stress test uses the true nuisance functions so the only thing changing is propensity overlap. As propensities move toward zero or one, weights become more variable and uncertainty increases.

overlap_rows = []
base_logit = true_logit.to_numpy()

for scale in [0.50, 0.85, 1.25, 1.75, 2.50]:
    stress_propensity = np.clip(sigmoid(scale * base_logit), 0.01, 0.99)
    stress_d = rng.binomial(1, stress_propensity, size=N_OBS)
    stress_y = np.where(stress_d == 1, potential_outcome_1, potential_outcome_0)
    stress_aipw = aipw_ate(stress_y, stress_d, true_g0, true_g1, stress_propensity, f"scale={scale}")
    stress_weights = stress_d / stress_propensity + (1 - stress_d) / (1 - stress_propensity)
    overlap_rows.append(
        {
            "logit_scale": scale,
            "treatment_rate": float(stress_d.mean()),
            "propensity_p01": float(np.quantile(stress_propensity, 0.01)),
            "propensity_p99": float(np.quantile(stress_propensity, 0.99)),
            "max_weight": float(np.max(stress_weights)),
            "effective_sample_size": weighted_effective_sample_size(stress_weights),
            "theta_hat_oracle_aipw": stress_aipw["theta_hat"],
            "std_error_oracle_aipw": stress_aipw["std_error"],
            "ci_95_lower": stress_aipw["ci_95_lower"],
            "ci_95_upper": stress_aipw["ci_95_upper"],
        }
    )

overlap_stress = pd.DataFrame(overlap_rows)
overlap_stress["true_ate"] = TRUE_ATE
overlap_stress["bias_vs_true_ate"] = overlap_stress["theta_hat_oracle_aipw"] - TRUE_ATE
save_table(overlap_stress, "overlap_stress_test")
display(overlap_stress)

	logit_scale	treatment_rate	propensity_p01	propensity_p99	max_weight	effective_sample_size	theta_hat_oracle_aipw	std_error_oracle_aipw	ci_95_lower	ci_95_upper	true_ate	bias_vs_true_ate
0	0.50	0.48125	0.218448	0.750508	6.445538	3741.777106	0.975751	0.032919	0.911229	1.040272	0.996435	-0.020684
1	0.85	0.47300	0.102749	0.866718	13.988948	3236.984035	1.020001	0.035749	0.949934	1.090068	0.996435	0.023566
2	1.25	0.46175	0.039664	0.940100	40.438987	2328.986111	0.923440	0.039034	0.846933	0.999947	0.996435	-0.072995
3	1.75	0.45175	0.011412	0.979258	65.500811	1727.041898	1.050359	0.045686	0.960815	1.139902	0.996435	0.053924
4	2.50	0.44900	0.010000	0.990000	100.000000	693.711527	1.004237	0.054236	0.897934	1.110540	0.996435	0.007802

As the treatment rule becomes more extreme, the effective sample size falls and the maximum weight rises. This is the overlap problem in one table.

Overlap Stress Plot

The left panel shows uncertainty in the oracle AIPW estimate as overlap worsens. The right panel shows the effective sample size implied by inverse-propensity weights.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

axes[0].errorbar(
    x=overlap_stress["logit_scale"],
    y=overlap_stress["theta_hat_oracle_aipw"],
    yerr=[
        overlap_stress["theta_hat_oracle_aipw"] - overlap_stress["ci_95_lower"],
        overlap_stress["ci_95_upper"] - overlap_stress["theta_hat_oracle_aipw"],
    ],
    fmt="o-",
    color="#2563eb",
    ecolor="#475569",
    capsize=4,
)
axes[0].axhline(TRUE_ATE, color="#b91c1c", linestyle="--", linewidth=1.5)
axes[0].set_title("Oracle AIPW Under Overlap Stress")
axes[0].set_xlabel("Treatment Logit Scale")
axes[0].set_ylabel("Estimated ATE")

sns.lineplot(
    data=overlap_stress,
    x="logit_scale",
    y="effective_sample_size",
    marker="o",
    color="#047857",
    ax=axes[1],
)
axes[1].set_title("Weight Effective Sample Size")
axes[1].set_xlabel("Treatment Logit Scale")
axes[1].set_ylabel("Effective Sample Size")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_overlap_stress_test.png", dpi=160, bbox_inches="tight")
plt.show()

Poor overlap is not solved by a fancier model. It changes the support of the causal question. Sometimes the honest fix is to narrow the target population.

Propensity Clipping Sensitivity

In practice, analysts often clip or trim extreme propensity scores. Clipping can reduce variance from extreme weights, but it also changes the exact estimating equation.

This cell recomputes manual AIPW estimates under several clipping thresholds.

clipping_rows = []
for clip in [0.01, 0.02, 0.05, 0.10]:
    clipped_m = np.clip(manual_m_hat, clip, 1 - clip)
    ate_row = aipw_ate(y, d, manual_g0_hat, manual_g1_hat, clipped_m, f"AIPW clip={clip:.2f}")
    att_row = aipw_att(y, d, manual_g0_hat, clipped_m, f"ATT-style AIPW clip={clip:.2f}")
    for row in [ate_row, att_row]:
        row["clip_threshold"] = clip
        row["true_target"] = TRUE_ATT if row["target"] == "ATT-style" else TRUE_ATE
        row["bias_vs_target"] = row["theta_hat"] - row["true_target"]
        clipping_rows.append(row)

clipping_sensitivity = pd.DataFrame(clipping_rows)
save_table(clipping_sensitivity, "propensity_clipping_sensitivity")
display(clipping_sensitivity)

	estimator	target	theta_hat	std_error	ci_95_lower	ci_95_upper	p_value	clip_threshold	true_target	bias_vs_target
0	AIPW clip=0.01	ATE	0.961524	0.052736	0.858163	1.064886	NaN	0.01	0.996435	-0.034910
1	ATT-style AIPW clip=0.01	ATT-style	1.116128	0.064518	0.989673	1.242583	NaN	0.01	1.132978	-0.016850
2	AIPW clip=0.02	ATE	0.961524	0.052736	0.858163	1.064886	NaN	0.02	0.996435	-0.034910
3	ATT-style AIPW clip=0.02	ATT-style	1.116128	0.064518	0.989673	1.242583	NaN	0.02	1.132978	-0.016850
4	AIPW clip=0.05	ATE	0.965683	0.051320	0.865096	1.066270	NaN	0.05	0.996435	-0.030752
5	ATT-style AIPW clip=0.05	ATT-style	1.111452	0.063686	0.986628	1.236275	NaN	0.05	1.132978	-0.021527
6	AIPW clip=0.10	ATE	0.964527	0.047148	0.872116	1.056938	NaN	0.10	0.996435	-0.031908
7	ATT-style AIPW clip=0.10	ATT-style	1.105364	0.058841	0.990035	1.220693	NaN	0.10	1.132978	-0.027614

The clipping table is stable here because the propensity distribution is already well behaved. In weaker-overlap data, clipping sensitivity can be much larger.

When IRM Is The Right Or Wrong Tool

IRM is a good fit when:

the treatment is binary;
observed covariates plausibly adjust for treatment assignment;
there is overlap between treated and untreated units;
the target is ATE or ATT-style.

IRM is not enough when:

there is important unobserved confounding;
the treatment is continuous;
the treatment is assigned by an instrument or cutoff design;
overlap is so weak that treated and untreated units are not comparable;
interference between units is central to the question.

Reporting Checklist

This checklist turns the notebook into a reusable binary-treatment workflow.

A credible IRM report should include the target estimand, the treatment definition, control timing, propensity overlap, nuisance diagnostics, uncertainty, and limitations.

reporting_checklist = pd.DataFrame(
    [
        {"item": "Causal question", "status": "Estimate effect of feature_exposure on weekly_value."},
        {"item": "Treatment type", "status": "Binary treatment, so IRM is appropriate for ATE and ATTE-style targets."},
        {"item": "Estimands", "status": "Both ATE and ATT-style targets are shown because treatment effects are heterogeneous."},
        {"item": "Control timing", "status": "All controls are pre-treatment by construction in this synthetic dataset."},
        {"item": "Unconfoundedness", "status": "Holds by construction here; requires design evidence in real data."},
        {"item": "Overlap", "status": "Propensity and weight diagnostics reported."},
        {"item": "Nuisance learners", "status": "Compared linear/logistic and gradient-boosted nuisance specifications."},
        {"item": "Cross-fitting", "status": "Manual AIPW and DoubleMLIRM use cross-fitted nuisance predictions."},
        {"item": "Uncertainty", "status": "Standard errors, confidence intervals, and bootstrap intervals reported."},
        {"item": "Stability", "status": "Repeated sample splitting and clipping sensitivity included."},
        {"item": "Main limitation", "status": "IRM does not solve hidden confounding or support violations."},
    ]
)
save_table(reporting_checklist, "irm_reporting_checklist")
display(reporting_checklist)

	item	status
0	Causal question	Estimate effect of feature_exposure on weekly_...
1	Treatment type	Binary treatment, so IRM is appropriate for AT...
2	Estimands	Both ATE and ATT-style targets are shown becau...
3	Control timing	All controls are pre-treatment by construction...
4	Unconfoundedness	Holds by construction here; requires design ev...
5	Overlap	Propensity and weight diagnostics reported.
6	Nuisance learners	Compared linear/logistic and gradient-boosted ...
7	Cross-fitting	Manual AIPW and DoubleMLIRM use cross-fitted n...
8	Uncertainty	Standard errors, confidence intervals, and boo...
9	Stability	Repeated sample splitting and clipping sensiti...
10	Main limitation	IRM does not solve hidden confounding or suppo...

The checklist forces the analysis to stay causal, not merely predictive. A polished propensity model is not a substitute for a credible assignment story.

Report Template

The next cell writes a short markdown report template using the main gradient-boosted IRM estimates.

This template is meant to be adapted. In real analyses, the synthetic truth checks would be replaced by design evidence, sensitivity analysis, and robustness checks.

main_ate = irm_summary.loc[irm_summary["estimator"] == "Gradient boosting nuisance IRM ATE"].iloc[0]
main_atte = irm_summary.loc[irm_summary["estimator"] == "Gradient boosting nuisance IRM ATTE"].iloc[0]
ess_value = dml_propensity_diagnostics.loc[
    dml_propensity_diagnostics["diagnostic"] == "IPW effective sample size", "value"
].iloc[0]

report_text = f"""
# IRM Effect Estimate Report Template

## Causal Question
Estimate the effect of `feature_exposure` on `weekly_value` using observed pre-treatment controls.

## Estimands
Two targets are reported:
- ATE: average effect over the full analysis population.
- ATTE: effect for the treated units under the DoubleML treated-target score.

## Main Estimates
ATE estimate:
- Estimated effect: {main_ate['theta_hat']:.4f}
- Standard error: {main_ate['std_error']:.4f}
- 95 percent confidence interval: [{main_ate['ci_95_lower']:.4f}, {main_ate['ci_95_upper']:.4f}]

ATTE estimate:
- Estimated effect: {main_atte['theta_hat']:.4f}
- Standard error: {main_atte['std_error']:.4f}
- 95 percent confidence interval: [{main_atte['ci_95_lower']:.4f}, {main_atte['ci_95_upper']:.4f}]

## Estimator
The main estimator is `DoubleMLIRM` with five-fold cross-fitting, histogram gradient-boosted outcome models, and a histogram gradient-boosted propensity classifier.

## Diagnostics Included
- Difference-in-means, OLS adjustment, manual IPW, manual AIPW, and DoubleML comparisons.
- Propensity overlap and inverse-propensity weight diagnostics.
- Effective sample size from IPW weights: {ess_value:.1f}.
- Outcome nuisance and propensity nuisance quality checks.
- Orthogonal score contribution checks.
- Bootstrap confidence intervals.
- Repeated sample-splitting stability.
- Propensity clipping sensitivity.
- Overlap stress test.

## Required Assumptions
The estimates rely on consistency, conditional unconfoundedness given the selected controls, overlap, and no interference. DoubleML improves estimation under these assumptions but does not establish them.
""".strip()

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_irm_report_template.md"
report_path.write_text(report_text)
print(report_text)

# IRM Effect Estimate Report Template

## Causal Question
Estimate the effect of `feature_exposure` on `weekly_value` using observed pre-treatment controls.

## Estimands
Two targets are reported:
- ATE: average effect over the full analysis population.
- ATTE: effect for the treated units under the DoubleML treated-target score.

## Main Estimates
ATE estimate:
- Estimated effect: 0.9853
- Standard error: 0.0539
- 95 percent confidence interval: [0.8797, 1.0909]

ATTE estimate:
- Estimated effect: 1.0483
- Standard error: 0.0699
- 95 percent confidence interval: [0.9113, 1.1852]

## Estimator
The main estimator is `DoubleMLIRM` with five-fold cross-fitting, histogram gradient-boosted outcome models, and a histogram gradient-boosted propensity classifier.

## Diagnostics Included
- Difference-in-means, OLS adjustment, manual IPW, manual AIPW, and DoubleML comparisons.
- Propensity overlap and inverse-propensity weight diagnostics.
- Effective sample size from IPW weights: 2225.4.
- Outcome nuisance and propensity nuisance quality checks.
- Orthogonal score contribution checks.
- Bootstrap confidence intervals.
- Repeated sample-splitting stability.
- Propensity clipping sensitivity.
- Overlap stress test.

## Required Assumptions
The estimates rely on consistency, conditional unconfoundedness given the selected controls, overlap, and no interference. DoubleML improves estimation under these assumptions but does not establish them.

The report template keeps the estimands, assumptions, overlap diagnostics, and uncertainty attached to the numeric estimates. That is the right shape for binary-treatment causal reporting.

Artifact Manifest

The final cell lists every artifact produced by this notebook so the outputs are easy to find later.

artifact_rows = []
for folder, artifact_type in [
    (DATASET_DIR, "dataset"),
    (TABLE_DIR, "table"),
    (FIGURE_DIR, "figure"),
    (REPORT_DIR, "report"),
]:
    for file_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": file_path.name,
                "relative_path": str(file_path.relative_to(REPO_ROOT)),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
save_table(artifact_manifest, "artifact_manifest")
display(artifact_manifest)

	artifact_type	file_name	relative_path
0	dataset	05_synthetic_irm_data.csv	notebooks/tutorials/doubleml/outputs/datasets/...
1	table	05_all_estimator_comparison.csv	notebooks/tutorials/doubleml/outputs/tables/05...
2	table	05_baseline_estimates.csv	notebooks/tutorials/doubleml/outputs/tables/05...
3	table	05_data_audit.csv	notebooks/tutorials/doubleml/outputs/tables/05...
4	table	05_design_correlation_matrix.csv	notebooks/tutorials/doubleml/outputs/tables/05...
5	table	05_doubleml_irm_summary.csv	notebooks/tutorials/doubleml/outputs/tables/05...
6	table	05_doubleml_nuisance_losses.csv	notebooks/tutorials/doubleml/outputs/tables/05...
7	table	05_doubleml_propensity_diagnostics.csv	notebooks/tutorials/doubleml/outputs/tables/05...
8	table	05_doublemldata_backend_summary.csv	notebooks/tutorials/doubleml/outputs/tables/05...
9	table	05_field_dictionary.csv	notebooks/tutorials/doubleml/outputs/tables/05...
10	table	05_hgb_bootstrap_confidence_intervals.csv	notebooks/tutorials/doubleml/outputs/tables/05...
11	table	05_hgb_nuisance_quality_against_truth.csv	notebooks/tutorials/doubleml/outputs/tables/05...
12	table	05_irm_reporting_checklist.csv	notebooks/tutorials/doubleml/outputs/tables/05...
13	table	05_learner_catalog.csv	notebooks/tutorials/doubleml/outputs/tables/05...
14	table	05_manual_crossfit_estimates.csv	notebooks/tutorials/doubleml/outputs/tables/05...
15	table	05_manual_nuisance_quality.csv	notebooks/tutorials/doubleml/outputs/tables/05...
16	table	05_manual_propensity_diagnostics.csv	notebooks/tutorials/doubleml/outputs/tables/05...
17	table	05_orthogonal_score_summary.csv	notebooks/tutorials/doubleml/outputs/tables/05...
18	table	05_overlap_stress_test.csv	notebooks/tutorials/doubleml/outputs/tables/05...
19	table	05_package_versions.csv	notebooks/tutorials/doubleml/outputs/tables/05...
20	table	05_propensity_clipping_sensitivity.csv	notebooks/tutorials/doubleml/outputs/tables/05...
21	table	05_repeated_sample_split_stability.csv	notebooks/tutorials/doubleml/outputs/tables/05...
22	table	05_target_summary.csv	notebooks/tutorials/doubleml/outputs/tables/05...
23	figure	05_baseline_estimates.png	notebooks/tutorials/doubleml/outputs/figures/0...
24	figure	05_design_correlation_matrix.png	notebooks/tutorials/doubleml/outputs/figures/0...
25	figure	05_doubleml_nuisance_prediction_quality.png	notebooks/tutorials/doubleml/outputs/figures/0...
26	figure	05_irm_design_dag.png	notebooks/tutorials/doubleml/outputs/figures/0...
27	figure	05_irm_estimate_comparison.png	notebooks/tutorials/doubleml/outputs/figures/0...
28	figure	05_manual_outcome_nuisance_quality.png	notebooks/tutorials/doubleml/outputs/figures/0...
29	figure	05_orthogonal_score_distribution.png	notebooks/tutorials/doubleml/outputs/figures/0...
30	figure	05_overlap_stress_test.png	notebooks/tutorials/doubleml/outputs/figures/0...
31	figure	05_propensity_and_effect_heterogeneity.png	notebooks/tutorials/doubleml/outputs/figures/0...
32	figure	05_propensity_overlap_and_weights.png	notebooks/tutorials/doubleml/outputs/figures/0...
33	figure	05_repeated_split_stability.png	notebooks/tutorials/doubleml/outputs/figures/0...
34	report	05_irm_report_template.md	notebooks/tutorials/doubleml/outputs/reports/0...

The IRM notebook is complete. The next natural topic is the interactive IV model for binary treatments with instruments and compliance-style reasoning.