DoubleML Tutorial 18: End-To-End DoubleML Case Study

This notebook closes the DoubleML tutorial series with a full case study. The goal is not to introduce one more estimator in isolation. The goal is to show how a complete DoubleML analysis should read from start to finish: causal question, estimand, data roles, estimator choice, learners, diagnostics, uncertainty, sensitivity checks, subgroup exploration, and a short final report.

The case study asks a generic product analytics question:

What is the causal effect of receiving a guided onboarding nudge on next-period user value?

The treatment is binary: a user either receives the nudge or does not. The outcome is continuous: a next-period value index. Because treatment is observational rather than randomized, treated and untreated users differ in baseline intent, engagement, tenure, support needs, and other pre-treatment signals.

The target estimand is the average treatment effect:

\[ \theta_0 = \mathbb{E}\left[Y(1) - Y(0)\right], \]

where \(Y(1)\) is the potential outcome if a user receives the nudge and \(Y(0)\) is the potential outcome if the same user does not. Identification requires the usual observational assumptions:

\[ \left(Y(1), Y(0)\right) \perp D \mid X, \]

\[ 0 < \mathbb{P}(D = 1 \mid X) < 1, \]

and stable measurement of treatment, outcome, and controls. The first condition says the observed pre-treatment controls \(X\) are sufficient for adjustment. The second says treated and untreated users overlap at comparable covariate values.

For a binary treatment, DoubleML’s interactive regression model estimates two outcome nuisance functions and one propensity nuisance function:

\[ g_0(0, X) = \mathbb{E}[Y \mid D = 0, X], \qquad g_0(1, X) = \mathbb{E}[Y \mid D = 1, X], \]

\[ m_0(X) = \mathbb{P}(D = 1 \mid X). \]

The doubly robust score for the ATE can be written as:

\[ \phi_i = \hat{g}(1, X_i) - \hat{g}(0, X_i) + \frac{D_i\left(Y_i - \hat{g}(1, X_i)\right)}{\hat{m}(X_i)} - \frac{(1-D_i)\left(Y_i - \hat{g}(0, X_i)\right)}{1-\hat{m}(X_i)}. \]

The estimate is the average of this orthogonal signal:

\[ \hat{\theta} = \frac{1}{n}\sum_{i=1}^{n}\phi_i. \]

The strength of this workflow is not that it makes observational data magically causal. Its strength is that it gives us a principled estimate after we have made the design assumptions explicit and checked the numerical risks.

Setup

This setup cell imports the full case-study stack and creates the shared tutorial output folders. The repository-root detection keeps paths correct whether the notebook is executed from the repo root or directly from the notebook directory.

from pathlib import Path
import os
import warnings

PROJECT_ROOT = Path.cwd().resolve()
while PROJECT_ROOT != PROJECT_ROOT.parent and not (PROJECT_ROOT / "pyproject.toml").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

if not (PROJECT_ROOT / "pyproject.toml").exists():
    raise FileNotFoundError("Could not locate pyproject.toml; run this notebook from inside the repository.")

OUTPUT_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MPLCONFIG_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MPLCONFIG_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ["MPLCONFIGDIR"] = str(MPLCONFIG_DIR)
warnings.filterwarnings("ignore", message="IProgress not found.*")
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="The estimated nu2 .*", category=UserWarning)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import doubleml as dml
from doubleml.utils.propensity_score_processing import PSProcessorConfig

from matplotlib.patches import FancyArrowPatch, FancyBboxPatch
from scipy.special import expit
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.linear_model import LassoCV, LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

sns.set_theme(style="whitegrid", context="talk")

RANDOM_SEED = 1818
NOTEBOOK_PREFIX = "18"
TRUE_TARGET_NAME = "true_ate"

print(f"DoubleML version: {dml.__version__}")
print(f"Writing artifacts under: {OUTPUT_DIR}")

DoubleML version: 0.11.2
Writing artifacts under: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs

The setup prints the DoubleML version and output root. Every generated file uses prefix 18, which makes the final artifact manifest easy to audit.

Helper Functions

The helper functions below are deliberately practical. A real analysis usually repeats the same actions many times: saving tables, fitting a model, extracting predictions, computing overlap diagnostics, building a doubly robust signal, and producing a compact estimate row.

def save_table(df, filename):
    path = TABLE_DIR / filename
    df.to_csv(path, index=False)
    return path


def save_dataset(df, filename):
    path = DATASET_DIR / filename
    df.to_csv(path, index=False)
    return path


def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


def make_kfold_splits(n_obs, seed, n_splits=5):
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    return [(train_idx, test_idx) for train_idx, test_idx in kfold.split(np.arange(n_obs))]


def extract_prediction(model, learner_name):
    return model.predictions[learner_name][:, 0, 0]


def fit_irm(df, x_cols, learner_g, learner_m, label, sample_splits=None, n_folds=5):
    data = dml.DoubleMLData(df, y_col="next_period_value", d_cols="onboarding_nudge", x_cols=x_cols)
    ps_config = PSProcessorConfig(clipping_threshold=0.02)
    model = dml.DoubleMLIRM(
        data,
        ml_g=clone(learner_g),
        ml_m=clone(learner_m),
        n_folds=n_folds,
        score="ATE",
        normalize_ipw=True,
        ps_processor_config=ps_config,
    )

    if sample_splits is not None:
        model.set_sample_splitting(sample_splits)

    model.fit(store_predictions=True)
    ci = model.confint(level=0.95).iloc[0]
    losses = model.evaluate_learners()

    e_hat = np.clip(extract_prediction(model, "ml_m"), 0.02, 0.98)
    mu0_hat = extract_prediction(model, "ml_g0")
    mu1_hat = extract_prediction(model, "ml_g1")

    row = {
        "model": label,
        "estimate": float(model.coef[0]),
        "std_error": float(model.se[0]),
        "ci_95_lower": float(ci["2.5 %"]),
        "ci_95_upper": float(ci["97.5 %"]),
        "true_ate": float(df["true_treatment_effect"].mean()),
        "absolute_error": abs(float(model.coef[0]) - float(df["true_treatment_effect"].mean())),
        "ml_g0_rmse": float(losses["ml_g0"][0][0]),
        "ml_g1_rmse": float(losses["ml_g1"][0][0]),
        "ml_m_log_loss": float(losses["ml_m"][0][0]),
        "propensity_min": float(e_hat.min()),
        "propensity_p05": float(np.quantile(e_hat, 0.05)),
        "propensity_median": float(np.median(e_hat)),
        "propensity_p95": float(np.quantile(e_hat, 0.95)),
        "propensity_max": float(e_hat.max()),
    }
    predictions = pd.DataFrame(
        {
            "mu0_hat": mu0_hat,
            "mu1_hat": mu1_hat,
            "propensity_hat": e_hat,
        },
        index=df.index,
    )
    return model, row, predictions


def doubly_robust_signal(df, predictions):
    y = df["next_period_value"].to_numpy()
    d = df["onboarding_nudge"].to_numpy()
    mu0 = predictions["mu0_hat"].to_numpy()
    mu1 = predictions["mu1_hat"].to_numpy()
    e = np.clip(predictions["propensity_hat"].to_numpy(), 0.02, 0.98)
    signal = mu1 - mu0 + d * (y - mu1) / e - (1 - d) * (y - mu0) / (1 - e)
    return signal

These helpers do not hide the analysis. They standardize repeated mechanical work so the notebook can focus on design choices and diagnostics. The doubly_robust_signal() helper mirrors the ATE score described in the introduction.

Case Study Roadmap

The roadmap table turns the rest of the notebook into a review checklist. End-to-end causal work should be readable in this order: design first, estimation second, diagnostics third, reporting last.

roadmap = pd.DataFrame(
    [
        {"stage": "Causal question", "main_question": "What treatment, outcome, population, and estimand are we studying?"},
        {"stage": "Data design", "main_question": "Which variables are pre-treatment controls, treatment, outcome, and invalid post-treatment signals?"},
        {"stage": "Estimator choice", "main_question": "Why is IRM appropriate for a binary treatment?"},
        {"stage": "Main estimate", "main_question": "What does DoubleML estimate after cross-fitting nuisance functions?"},
        {"stage": "Diagnostics", "main_question": "Do propensity scores overlap, do nuisance models behave plausibly, and are results split-stable?"},
        {"stage": "Heterogeneity", "main_question": "Which user segments appear to benefit more, and how should that be reported cautiously?"},
        {"stage": "Sensitivity", "main_question": "How strong would hidden confounding need to be to threaten the conclusion?"},
        {"stage": "Report", "main_question": "What evidence, assumptions, and limitations should a reader see?"},
    ]
)

save_table(roadmap, f"{NOTEBOOK_PREFIX}_case_study_roadmap.csv")
display(roadmap)

	stage	main_question
0	Causal question	What treatment, outcome, population, and estim...
1	Data design	Which variables are pre-treatment controls, tr...
2	Estimator choice	Why is IRM appropriate for a binary treatment?
3	Main estimate	What does DoubleML estimate after cross-fittin...
4	Diagnostics	Do propensity scores overlap, do nuisance mode...
5	Heterogeneity	Which user segments appear to benefit more, an...
6	Sensitivity	How strong would hidden confounding need to be...
7	Report	What evidence, assumptions, and limitations sh...

The roadmap gives the notebook a story. The key habit is that the causal question and variable timing come before model fitting.

Simulate The Case Study Data

We simulate data with known truth so students can see when the workflow succeeds and where it remains vulnerable. The observed controls are all measured before treatment. The treatment is onboarding_nudge, and the outcome is next_period_value.

Treatment assignment is observational:

\[ D_i \sim \text{Bernoulli}(m_0(X_i)). \]

The outcome is generated from baseline value plus a heterogeneous treatment effect:

\[ Y_i = b_0(X_i) + \tau(X_i)D_i + \varepsilon_i. \]

The true ATE is:

\[ \theta_0 = \mathbb{E}[\tau(X_i)]. \]

A real case study would not know \(\tau(X_i)\) or \(m_0(X_i)\). Here they are retained only for teaching and validation.

rng = np.random.default_rng(RANDOM_SEED)
n_obs = 3_200

engagement_score = rng.normal(0, 1, n_obs)
intent_score = rng.normal(0, 1, n_obs)
content_breadth = rng.normal(0, 1, n_obs)
price_sensitivity = rng.normal(0, 1, n_obs)
mobile_share = rng.beta(4, 3, n_obs)
weekend_share = rng.beta(3, 5, n_obs)
account_age_weeks = rng.gamma(shape=2.4, scale=8.0, size=n_obs)
new_user = (account_age_weeks < np.quantile(account_age_weeks, 0.30)).astype(int)
email_opt_in = rng.binomial(1, expit(0.25 + 0.4 * engagement_score - 0.2 * price_sensitivity), n_obs)
support_contacts = rng.poisson(np.exp(-0.2 + 0.25 * price_sensitivity - 0.20 * engagement_score))

propensity_logit = (
    -0.35
    + 0.65 * intent_score
    + 0.35 * new_user
    + 0.30 * email_opt_in
    - 0.35 * price_sensitivity
    + 0.18 * mobile_share
    - 0.12 * support_contacts
    + 0.10 * content_breadth
)
true_propensity = np.clip(expit(propensity_logit), 0.05, 0.95)
onboarding_nudge = rng.binomial(1, true_propensity, n_obs)

true_treatment_effect = (
    0.34
    + 0.14 * intent_score
    + 0.11 * new_user
    - 0.06 * price_sensitivity
    + 0.05 * email_opt_in
    + 0.04 * mobile_share
)

baseline_value = (
    2.20
    + 0.70 * engagement_score
    + 0.45 * intent_score
    + 0.25 * content_breadth
    + 0.12 * np.sqrt(account_age_weeks)
    - 0.30 * price_sensitivity
    - 0.20 * support_contacts
    + 0.25 * email_opt_in
    + 0.10 * np.sin(2 * np.pi * weekend_share)
)

next_period_value = baseline_value + true_treatment_effect * onboarding_nudge + rng.normal(0, 0.90, n_obs)

# This intentionally invalid variable is created after treatment and outcome realization.
post_treatment_activity_proxy = 0.45 * next_period_value + 0.25 * onboarding_nudge + rng.normal(0, 0.55, n_obs)

case_df = pd.DataFrame(
    {
        "engagement_score": engagement_score,
        "intent_score": intent_score,
        "content_breadth": content_breadth,
        "price_sensitivity": price_sensitivity,
        "mobile_share": mobile_share,
        "weekend_share": weekend_share,
        "account_age_weeks": account_age_weeks,
        "new_user": new_user,
        "email_opt_in": email_opt_in,
        "support_contacts": support_contacts,
        "onboarding_nudge": onboarding_nudge,
        "next_period_value": next_period_value,
        "post_treatment_activity_proxy": post_treatment_activity_proxy,
        "true_propensity": true_propensity,
        "true_treatment_effect": true_treatment_effect,
    }
)

case_df["intent_segment"] = pd.qcut(
    case_df["intent_score"],
    q=3,
    labels=["low_intent", "mid_intent", "high_intent"],
)

x_cols = [
    "engagement_score",
    "intent_score",
    "content_breadth",
    "price_sensitivity",
    "mobile_share",
    "weekend_share",
    "account_age_weeks",
    "new_user",
    "email_opt_in",
    "support_contacts",
]

save_dataset(case_df, f"{NOTEBOOK_PREFIX}_end_to_end_case_study_data.csv")
case_df.head()

	engagement_score	intent_score	content_breadth	price_sensitivity	mobile_share	weekend_share	account_age_weeks	new_user	email_opt_in	support_contacts	onboarding_nudge	next_period_value	post_treatment_activity_proxy	true_propensity	true_treatment_effect	intent_segment
0	0.694589	-0.187952	1.302988	-0.089980	0.599224	0.547231	16.345623	0	0	1	0	3.805909	1.461115	0.420063	0.343054	mid_intent
1	0.589058	0.868203	-0.995821	0.703798	0.721945	0.301545	15.095610	0	0	0	1	3.430184	2.047801	0.499593	0.448198	high_intent
2	0.838012	0.042641	0.390679	2.390325	0.602758	0.326320	7.513168	1	0	2	0	1.888274	0.165999	0.288777	0.336661	mid_intent
3	1.556491	1.146182	-0.830658	0.484532	0.656579	0.772629	8.053153	1	1	0	0	2.979645	1.295039	0.713113	0.657657	high_intent
4	-1.094204	-2.179274	-1.406732	-0.702787	0.434521	0.303187	13.355840	0	1	0	0	0.308343	0.135570	0.217038	0.144450	low_intent

The first rows include the teaching-only columns true_propensity and true_treatment_effect. Those columns are not included in the DoubleML feature set. They are kept so we can compare the estimated ATE to the known truth.

Field Dictionary

This field dictionary documents the timing and role of every column. In an observational analysis, this table is not decorative. It is the difference between valid adjustment and accidental leakage.

field_dictionary = pd.DataFrame(
    [
        {"field": "engagement_score", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline engagement before nudge eligibility."},
        {"field": "intent_score", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline intent or product-fit signal."},
        {"field": "content_breadth", "role": "pre-treatment control", "used_in_model": True, "description": "Breadth of prior usage across app areas."},
        {"field": "price_sensitivity", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline sensitivity or friction signal."},
        {"field": "mobile_share", "role": "pre-treatment control", "used_in_model": True, "description": "Share of prior activity on mobile."},
        {"field": "weekend_share", "role": "pre-treatment control", "used_in_model": True, "description": "Share of prior activity occurring on weekends."},
        {"field": "account_age_weeks", "role": "pre-treatment control", "used_in_model": True, "description": "Tenure before treatment eligibility."},
        {"field": "new_user", "role": "pre-treatment control", "used_in_model": True, "description": "Indicator derived from account age."},
        {"field": "email_opt_in", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline communication eligibility."},
        {"field": "support_contacts", "role": "pre-treatment control", "used_in_model": True, "description": "Prior support-contact count."},
        {"field": "onboarding_nudge", "role": "treatment", "used_in_model": "treatment column", "description": "Binary exposure to the guided onboarding nudge."},
        {"field": "next_period_value", "role": "outcome", "used_in_model": "outcome column", "description": "Continuous value index after treatment."},
        {"field": "intent_segment", "role": "reporting segment", "used_in_model": False, "description": "Segment used for subgroup summaries."},
        {"field": "post_treatment_activity_proxy", "role": "post-treatment leakage", "used_in_model": False, "description": "Invalid control retained only as a warning example."},
        {"field": "true_propensity", "role": "oracle teaching column", "used_in_model": False, "description": "Known treatment probability from simulation."},
        {"field": "true_treatment_effect", "role": "oracle teaching column", "used_in_model": False, "description": "Known individual treatment effect from simulation."},
    ]
)

save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")
display(field_dictionary)

	field	role	used_in_model	description
0	engagement_score	pre-treatment control	True	Baseline engagement before nudge eligibility.
1	intent_score	pre-treatment control	True	Baseline intent or product-fit signal.
2	content_breadth	pre-treatment control	True	Breadth of prior usage across app areas.
3	price_sensitivity	pre-treatment control	True	Baseline sensitivity or friction signal.
4	mobile_share	pre-treatment control	True	Share of prior activity on mobile.
5	weekend_share	pre-treatment control	True	Share of prior activity occurring on weekends.
6	account_age_weeks	pre-treatment control	True	Tenure before treatment eligibility.
7	new_user	pre-treatment control	True	Indicator derived from account age.
8	email_opt_in	pre-treatment control	True	Baseline communication eligibility.
9	support_contacts	pre-treatment control	True	Prior support-contact count.
10	onboarding_nudge	treatment	treatment column	Binary exposure to the guided onboarding nudge.
11	next_period_value	outcome	outcome column	Continuous value index after treatment.
12	intent_segment	reporting segment	False	Segment used for subgroup summaries.
13	post_treatment_activity_proxy	post-treatment leakage	False	Invalid control retained only as a warning exa...
14	true_propensity	oracle teaching column	False	Known treatment probability from simulation.
15	true_treatment_effect	oracle teaching column	False	Known individual treatment effect from simulat...

The feature set includes only pre-treatment controls. The post-treatment proxy is intentionally excluded. A real case study should include this kind of timing table before any model results.

Data Audit

The data audit checks sample size, treatment rate, outcome scale, known true ATE, and propensity overlap from the simulator. In a real observational dataset, the oracle columns would not be available, but the same audit structure still applies.

data_audit = pd.DataFrame(
    [
        {"metric": "rows", "value": len(case_df)},
        {"metric": "treatment_rate", "value": case_df["onboarding_nudge"].mean()},
        {"metric": "outcome_mean", "value": case_df["next_period_value"].mean()},
        {"metric": "outcome_sd", "value": case_df["next_period_value"].std()},
        {"metric": "true_ate", "value": case_df["true_treatment_effect"].mean()},
        {"metric": "true_propensity_p05", "value": case_df["true_propensity"].quantile(0.05)},
        {"metric": "true_propensity_p95", "value": case_df["true_propensity"].quantile(0.95)},
        {"metric": "pre_treatment_controls", "value": len(x_cols)},
    ]
)

save_table(data_audit, f"{NOTEBOOK_PREFIX}_data_audit.csv")
display(data_audit)

	metric	value
0	rows	3200.000000
1	treatment_rate	0.463125
2	outcome_mean	2.917136
3	outcome_sd	1.451041
4	true_ate	0.425708
5	true_propensity_p05	0.202593
6	true_propensity_p95	0.777225
7	pre_treatment_controls	10.000000

The treatment rate and propensity quantiles suggest there is meaningful treated and untreated support. This does not prove overlap after flexible estimation, so we will inspect estimated propensities later.

Design Diagram

The diagram below now focuses only on the causal design. The nuisance models used by DoubleML are introduced later in the estimation section. Keeping the graph causal makes the arrows easier to read: baseline controls affect both nudge assignment and future value, the nudge affects future value, baseline profile can modify the treatment effect, and dashed arrows mark unmeasured-confounding risk.

nodes = {
    "X": {"xy": (0.16, 0.66), "label": "Observed baseline\ncontrols X", "color": "#dbeafe"},
    "T": {"xy": (0.43, 0.82), "label": "Effect variation\ntau(X)", "color": "#dcfce7"},
    "D": {"xy": (0.43, 0.38), "label": "Onboarding\nnudge D", "color": "#fef3c7"},
    "Y": {"xy": (0.75, 0.58), "label": "Next-period\nvalue Y", "color": "#fee2e2"},
    "U": {"xy": (0.75, 0.24), "label": "Unmeasured\nrisk", "color": "#f3f4f6"},
}

fig, ax = plt.subplots(figsize=(12, 6.2))
ax.set_axis_off()
ax.set_xlim(0.00, 1.00)
ax.set_ylim(0.00, 1.00)
box_w, box_h = 0.15, 0.12


def anchor(node, side):
    x, y = nodes[node]["xy"]
    offsets = {
        "left": (-box_w / 2, 0),
        "right": (box_w / 2, 0),
        "top": (0, box_h / 2),
        "bottom": (0, -box_h / 2),
        "upper_right": (box_w / 2, box_h * 0.25),
        "lower_right": (box_w / 2, -box_h * 0.25),
        "upper_left": (-box_w / 2, box_h * 0.25),
        "lower_left": (-box_w / 2, -box_h * 0.25),
    }
    dx, dy = offsets[side]
    return np.array([x + dx, y + dy], dtype=float)


def shorten(start, end, gap=0.038):
    start = np.asarray(start, dtype=float)
    end = np.asarray(end, dtype=float)
    delta = end - start
    length = np.hypot(delta[0], delta[1])
    if length == 0:
        return tuple(start), tuple(end)
    unit = delta / length
    return tuple(start + gap * unit), tuple(end - gap * unit)


def draw_arrow(start, end, color="#334155", style="solid", rad=0.0, linewidth=1.7):
    start, end = shorten(start, end)
    arrow = FancyArrowPatch(
        start,
        end,
        arrowstyle="-|>",
        mutation_scale=18,
        linewidth=linewidth,
        color=color,
        linestyle=style,
        connectionstyle=f"arc3,rad={rad}",
        zorder=2,
        clip_on=False,
    )
    ax.add_patch(arrow)

# Observed causal design paths.
draw_arrow(anchor("X", "lower_right"), anchor("D", "upper_left"), color="#334155")
draw_arrow(anchor("X", "right"), anchor("Y", "left"), color="#334155", rad=0.04)
draw_arrow(anchor("D", "right"), anchor("Y", "lower_left"), color="#b45309")

# Effect-modification path: X determines who benefits more; tau(X) changes the treatment response.
draw_arrow(anchor("X", "upper_right"), anchor("T", "left"), color="#15803d")
draw_arrow(anchor("T", "right"), anchor("Y", "upper_left"), color="#15803d", rad=-0.03)

# Dashed paths show what would threaten the observed-design claim.
draw_arrow(anchor("U", "upper_left"), anchor("D", "lower_right"), color="#6b7280", style="dashed", rad=-0.02, linewidth=1.4)
draw_arrow(anchor("U", "top"), anchor("Y", "bottom"), color="#6b7280", style="dashed", linewidth=1.4)

for spec in nodes.values():
    x, y = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.018",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)

ax.text(
    0.50,
    0.08,
    "Solid arrows show the observed causal design; green arrows show effect variation; dashed arrows show hidden-confounding risk.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("End-To-End DoubleML Case Study Design", pad=18)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_case_study_design_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The dashed arrows represent residual hidden-confounding risk. DoubleML can estimate an orthogonal score for the observed design, but it cannot verify that all relevant confounders are measured.

Naive And Adjusted Baselines

Before fitting DoubleML, we compute two simple baselines. The naive difference in means ignores confounding. The adjusted linear regression controls for the same pre-treatment features but uses a simple linear outcome model. These are comparison points, not the final estimator.

treated_mean = case_df.loc[case_df["onboarding_nudge"].eq(1), "next_period_value"].mean()
control_mean = case_df.loc[case_df["onboarding_nudge"].eq(0), "next_period_value"].mean()
naive_diff = treated_mean - control_mean

ols_features = case_df[["onboarding_nudge"] + x_cols]
ols_model = LinearRegression().fit(ols_features, case_df["next_period_value"])
ols_adjusted_effect = float(ols_model.coef_[0])

baseline_estimates = pd.DataFrame(
    [
        {"estimator": "Naive difference in means", "estimate": naive_diff, "true_ate": case_df["true_treatment_effect"].mean()},
        {"estimator": "Adjusted linear regression", "estimate": ols_adjusted_effect, "true_ate": case_df["true_treatment_effect"].mean()},
    ]
)
baseline_estimates["absolute_error"] = (baseline_estimates["estimate"] - baseline_estimates["true_ate"]).abs()

save_table(baseline_estimates, f"{NOTEBOOK_PREFIX}_baseline_estimates.csv")
display(baseline_estimates)

	estimator	estimate	true_ate	absolute_error
0	Naive difference in means	0.902143	0.425708	0.476436
1	Adjusted linear regression	0.412130	0.425708	0.013578

The naive estimate is expected to be biased because treatment assignment depends on baseline intent and other features. The adjusted linear regression is a stronger baseline, but it still imposes simple functional-form assumptions.

Fit The Main DoubleML IRM Model

The main model uses DoubleMLIRM because the treatment is binary. We use random forests for both outcome nuisance functions and the propensity model. Cross-fitting ensures that the score contribution for each row uses nuisance predictions from models that did not train on that row.

main_outcome_learner = RandomForestRegressor(
    n_estimators=240,
    min_samples_leaf=12,
    max_features="sqrt",
    random_state=RANDOM_SEED,
    n_jobs=-1,
)
main_propensity_learner = RandomForestClassifier(
    n_estimators=240,
    min_samples_leaf=12,
    max_features="sqrt",
    random_state=RANDOM_SEED + 1,
    n_jobs=-1,
)

main_irm, main_row, main_predictions = fit_irm(
    case_df,
    x_cols=x_cols,
    learner_g=main_outcome_learner,
    learner_m=main_propensity_learner,
    label="Random forest IRM",
)

main_estimate = pd.DataFrame([main_row])
save_table(main_estimate, f"{NOTEBOOK_PREFIX}_main_irm_estimate.csv")
display(main_estimate)

	model	estimate	std_error	ci_95_lower	ci_95_upper	true_ate	absolute_error	ml_g0_rmse	ml_g1_rmse	ml_m_log_loss	propensity_min	propensity_p05	propensity_median	propensity_p95	propensity_max
0	Random forest IRM	0.461497	0.037944	0.387127	0.535866	0.425708	0.035789	0.980687	0.980966	0.472507	0.09138	0.208811	0.460587	0.73166	0.868944

The main DoubleML estimate should be read with its confidence interval, not only the point estimate. Because this is a simulation, we can also compare it to the true ATE. In real data, that column would be unavailable.

Main Estimate Plot

The next plot compares the naive baseline, adjusted linear regression, and main DoubleML estimate against the known true ATE. This is a teaching plot: in real applications, the true vertical line would be replaced by a design-based benchmark or omitted entirely.

main_plot_df = pd.concat(
    [
        baseline_estimates.rename(columns={"estimator": "model"})[["model", "estimate", "true_ate"]],
        main_estimate[["model", "estimate", "true_ate", "ci_95_lower", "ci_95_upper"]],
    ],
    ignore_index=True,
)
main_plot_df["ci_95_lower"] = main_plot_df["ci_95_lower"].fillna(main_plot_df["estimate"])
main_plot_df["ci_95_upper"] = main_plot_df["ci_95_upper"].fillna(main_plot_df["estimate"])
main_plot_df["lower_error"] = main_plot_df["estimate"] - main_plot_df["ci_95_lower"]
main_plot_df["upper_error"] = main_plot_df["ci_95_upper"] - main_plot_df["estimate"]

fig, ax = plt.subplots(figsize=(10.5, 5.2))
ax.axvline(case_df["true_treatment_effect"].mean(), color="#111827", linestyle="--", linewidth=1.5, label="True ATE")
ax.errorbar(
    x=main_plot_df["estimate"],
    y=np.arange(len(main_plot_df)),
    xerr=[main_plot_df["lower_error"], main_plot_df["upper_error"]],
    fmt="o",
    color="#2563eb",
    ecolor="#64748b",
    capsize=4,
)
ax.set_yticks(np.arange(len(main_plot_df)))
ax.set_yticklabels(main_plot_df["model"])
ax.set_xlabel("Estimated ATE")
ax.set_ylabel("")
ax.set_title("Baseline And DoubleML Estimates")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_main_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The visual comparison shows why end-to-end analysis is useful. A flexible orthogonal estimator should move the answer toward the known causal target when the adjustment set is correct and overlap is reasonable.

Nuisance Diagnostics

IRM relies on outcome predictions for treated and untreated potential outcomes and a propensity prediction for treatment assignment. The diagnostics below summarize the held-out nuisance performance produced by cross-fitting.

nuisance_diagnostics = pd.DataFrame(
    [
        {"nuisance_model": "ml_g0", "target": "Outcome among untreated", "metric": "RMSE", "value": main_row["ml_g0_rmse"]},
        {"nuisance_model": "ml_g1", "target": "Outcome among treated", "metric": "RMSE", "value": main_row["ml_g1_rmse"]},
        {"nuisance_model": "ml_m", "target": "Propensity", "metric": "Log loss", "value": main_row["ml_m_log_loss"]},
    ]
)

save_table(nuisance_diagnostics, f"{NOTEBOOK_PREFIX}_main_nuisance_diagnostics.csv")
display(nuisance_diagnostics)

	nuisance_model	target	metric	value
0	ml_g0	Outcome among untreated	RMSE	0.980687
1	ml_g1	Outcome among treated	RMSE	0.980966
2	ml_m	Propensity	Log loss	0.472507

These diagnostics are numerical checks, not causal proof. Good nuisance performance supports the estimation workflow, but the identification assumptions still come from the study design.

Propensity Overlap Diagnostics

Overlap is central for binary-treatment causal inference. If estimated propensities are extremely close to zero or one, the doubly robust score can become unstable because inverse-propensity terms get large.

propensity_diagnostics = pd.DataFrame(
    [
        {"metric": "min", "value": main_predictions["propensity_hat"].min()},
        {"metric": "p01", "value": main_predictions["propensity_hat"].quantile(0.01)},
        {"metric": "p05", "value": main_predictions["propensity_hat"].quantile(0.05)},
        {"metric": "median", "value": main_predictions["propensity_hat"].median()},
        {"metric": "p95", "value": main_predictions["propensity_hat"].quantile(0.95)},
        {"metric": "p99", "value": main_predictions["propensity_hat"].quantile(0.99)},
        {"metric": "max", "value": main_predictions["propensity_hat"].max()},
        {"metric": "share_below_0.05", "value": (main_predictions["propensity_hat"] < 0.05).mean()},
        {"metric": "share_above_0.95", "value": (main_predictions["propensity_hat"] > 0.95).mean()},
    ]
)

save_table(propensity_diagnostics, f"{NOTEBOOK_PREFIX}_propensity_diagnostics.csv")
display(propensity_diagnostics)

	metric	value
0	min	0.091380
1	p01	0.157453
2	p05	0.208811
3	median	0.460587
4	p95	0.731660
5	p99	0.806554
6	max	0.868944
7	share_below_0.05	0.000000
8	share_above_0.95	0.000000

The estimated propensity distribution has no mass near the extreme boundaries after clipping. That makes the ATE estimate easier to trust numerically than a setting with many near-deterministic treatment assignments.

Plot Estimated Propensity Overlap

The histogram compares estimated propensity scores for treated and untreated users. We want enough overlap that both groups appear across the central support.

propensity_plot_df = case_df[["onboarding_nudge"]].join(main_predictions[["propensity_hat"]])
propensity_plot_df["group"] = np.where(propensity_plot_df["onboarding_nudge"].eq(1), "treated", "untreated")

fig, ax = plt.subplots(figsize=(10.5, 5.4))
sns.histplot(
    data=propensity_plot_df,
    x="propensity_hat",
    hue="group",
    bins=36,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    linewidth=2,
    ax=ax,
)
ax.axvline(0.02, color="#6b7280", linestyle="--", linewidth=1, label="clipping threshold")
ax.axvline(0.98, color="#6b7280", linestyle="--", linewidth=1)
ax.set_title("Estimated Propensity Overlap")
ax.set_xlabel(r"Estimated propensity $\hat{m}(X)$")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The treated and untreated distributions overlap in the middle of the propensity range. If one group appeared only near zero or one, the ATE would rely heavily on extrapolation.

Compare Learner Families

A complete case study should check whether the main result is a product of one learner choice. We compare three nuisance-model families: a linear/logistic baseline, random forests, and histogram gradient boosting.

learner_specs = {
    "Linear-logistic IRM": {
        "ml_g": make_pipeline(StandardScaler(), LassoCV(cv=5, random_state=RANDOM_SEED, max_iter=10_000)),
        "ml_m": make_pipeline(StandardScaler(), LogisticRegression(max_iter=2_000, solver="lbfgs")),
    },
    "Random forest IRM": {
        "ml_g": main_outcome_learner,
        "ml_m": main_propensity_learner,
    },
    "Histogram gradient boosting IRM": {
        "ml_g": HistGradientBoostingRegressor(max_iter=180, learning_rate=0.05, min_samples_leaf=25, random_state=RANDOM_SEED),
        "ml_m": HistGradientBoostingClassifier(max_iter=180, learning_rate=0.05, min_samples_leaf=25, random_state=RANDOM_SEED + 2),
    },
}

learner_rows = []
learner_models = {}
learner_predictions = {}
for label, spec in learner_specs.items():
    model, row, preds = fit_irm(
        case_df,
        x_cols=x_cols,
        learner_g=spec["ml_g"],
        learner_m=spec["ml_m"],
        label=label,
    )
    learner_models[label] = model
    learner_predictions[label] = preds
    learner_rows.append(row)

learner_comparison = pd.DataFrame(learner_rows).sort_values("absolute_error")
save_table(learner_comparison, f"{NOTEBOOK_PREFIX}_learner_family_comparison.csv")
display(learner_comparison)

	model	estimate	std_error	ci_95_lower	ci_95_upper	true_ate	absolute_error	ml_g0_rmse	ml_g1_rmse	ml_m_log_loss	propensity_min	propensity_p05	propensity_median	propensity_p95	propensity_max
0	Linear-logistic IRM	0.410063	0.034676	0.342100	0.478026	0.425708	0.015645	0.892582	0.899940	0.466185	0.042807	0.176977	0.458474	0.774297	0.958462
1	Random forest IRM	0.457877	0.037706	0.383974	0.531780	0.425708	0.032170	0.978987	0.975234	0.471871	0.084552	0.211543	0.464042	0.729396	0.885520
2	Histogram gradient boosting IRM	0.333706	0.048376	0.238891	0.428521	0.425708	0.092002	0.971182	0.959456	0.492131	0.020000	0.106707	0.454504	0.848139	0.967526

The learner comparison should be read as robustness evidence. If all reasonable learner families tell a similar story, the result is less dependent on a single modeling choice. If they diverge, the report should slow down and explain why.

Plot Learner Family Estimates

The plot below compares point estimates and confidence intervals across nuisance learner choices. The true ATE line is available only because this is a teaching simulation.

learner_plot = learner_comparison.sort_values("estimate").copy()
learner_plot["lower_error"] = learner_plot["estimate"] - learner_plot["ci_95_lower"]
learner_plot["upper_error"] = learner_plot["ci_95_upper"] - learner_plot["estimate"]

fig, ax = plt.subplots(figsize=(10.5, 5.4))
ax.axvline(case_df["true_treatment_effect"].mean(), color="#111827", linestyle="--", linewidth=1.5, label="True ATE")
ax.errorbar(
    x=learner_plot["estimate"],
    y=np.arange(len(learner_plot)),
    xerr=[learner_plot["lower_error"], learner_plot["upper_error"]],
    fmt="o",
    color="#2563eb",
    ecolor="#64748b",
    capsize=4,
)
ax.set_yticks(np.arange(len(learner_plot)))
ax.set_yticklabels(learner_plot["model"])
ax.set_xlabel("Estimated ATE")
ax.set_ylabel("")
ax.set_title("DoubleML Estimates Across Learner Families")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_learner_family_estimates.png", dpi=160, bbox_inches="tight")
plt.show()

The estimates are close enough to support a coherent main story. The learner table still belongs in the final report because it shows the result is not simply a random-forest artifact.

Doubly Robust Signal For Segment Diagnostics

The doubly robust signal is useful for exploratory subgroup summaries. Here we use it to ask whether high-intent users appear to benefit more from the nudge. This is a diagnostic use of the score signal, not a replacement for a formal policy-learning or GATE analysis.

case_df_with_signal = case_df.copy()
case_df_with_signal["dr_signal"] = doubly_robust_signal(case_df, main_predictions)

segment_summary = (
    case_df_with_signal.groupby("intent_segment", observed=True)
    .agg(
        n=("dr_signal", "size"),
        dr_ate=("dr_signal", "mean"),
        dr_se=("dr_signal", lambda x: x.std(ddof=1) / np.sqrt(len(x))),
        true_segment_ate=("true_treatment_effect", "mean"),
        treatment_rate=("onboarding_nudge", "mean"),
    )
    .reset_index()
)
segment_summary["ci_95_lower"] = segment_summary["dr_ate"] - 1.96 * segment_summary["dr_se"]
segment_summary["ci_95_upper"] = segment_summary["dr_ate"] + 1.96 * segment_summary["dr_se"]

save_table(segment_summary, f"{NOTEBOOK_PREFIX}_segment_dr_signal_summary.csv")
display(segment_summary)

	intent_segment	n	dr_ate	dr_se	true_segment_ate	treatment_rate	ci_95_lower	ci_95_upper
0	low_intent	1067	0.314353	0.072332	0.272389	0.303655	0.172582	0.456124
1	mid_intent	1066	0.508658	0.060451	0.425631	0.462477	0.390174	0.627142
2	high_intent	1067	0.560745	0.064020	0.579102	0.623243	0.435266	0.686223

The segment summary should move in the same direction as the known heterogeneous effect pattern. In real data, this would be a hypothesis-generating diagnostic unless accompanied by a formal heterogeneity design.

Plot Segment Effects

The segment plot compares doubly robust subgroup means with the oracle segment effects from the simulator. The oracle points are included only for teaching.

fig, ax = plt.subplots(figsize=(10.5, 5.4))
positions = np.arange(len(segment_summary))
ax.errorbar(
    positions - 0.08,
    segment_summary["dr_ate"],
    yerr=1.96 * segment_summary["dr_se"],
    fmt="o",
    color="#2563eb",
    capsize=4,
    label="DR signal mean",
)
ax.scatter(
    positions + 0.08,
    segment_summary["true_segment_ate"],
    color="#b45309",
    marker="D",
    label="True segment ATE",
)
ax.set_xticks(positions)
ax.set_xticklabels(segment_summary["intent_segment"])
ax.set_ylabel("Estimated effect")
ax.set_xlabel("Intent segment")
ax.set_title("Exploratory Segment Effects From The Doubly Robust Signal")
ax.legend(loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_segment_effects.png", dpi=160, bbox_inches="tight")
plt.show()

The high-intent segment shows a larger effect, matching the way the teaching data were generated. The important reporting habit is to label this as subgroup evidence, not as an automatically deployable targeting rule.

Sample-Split Stability

Cross-fitting uses random folds. A single split can occasionally be lucky or unlucky. We refit the main random-forest specification over several deterministic fold seeds to check whether the estimate is stable.

stability_rows = []
stability_seeds = list(range(RANDOM_SEED, RANDOM_SEED + 8))

stability_outcome_learner = RandomForestRegressor(
    n_estimators=140,
    min_samples_leaf=14,
    max_features="sqrt",
    random_state=RANDOM_SEED,
    n_jobs=-1,
)
stability_propensity_learner = RandomForestClassifier(
    n_estimators=140,
    min_samples_leaf=14,
    max_features="sqrt",
    random_state=RANDOM_SEED + 1,
    n_jobs=-1,
)

for seed in stability_seeds:
    sample_splits = make_kfold_splits(len(case_df), seed=seed, n_splits=5)
    model, row, _ = fit_irm(
        case_df,
        x_cols=x_cols,
        learner_g=stability_outcome_learner,
        learner_m=stability_propensity_learner,
        label=f"split_seed_{seed}",
        sample_splits=sample_splits,
    )
    row["split_seed"] = seed
    stability_rows.append(row)

split_stability = pd.DataFrame(stability_rows)
split_stability_summary = pd.DataFrame(
    [
        {
            "n_repetitions": len(split_stability),
            "mean_estimate": split_stability["estimate"].mean(),
            "sd_estimate": split_stability["estimate"].std(ddof=1),
            "min_estimate": split_stability["estimate"].min(),
            "max_estimate": split_stability["estimate"].max(),
            "true_ate": case_df["true_treatment_effect"].mean(),
        }
    ]
)

save_table(split_stability, f"{NOTEBOOK_PREFIX}_sample_split_stability.csv")
save_table(split_stability_summary, f"{NOTEBOOK_PREFIX}_sample_split_stability_summary.csv")
display(split_stability_summary)

	n_repetitions	mean_estimate	sd_estimate	min_estimate	max_estimate	true_ate
0	8	0.462021	0.006578	0.450427	0.469819	0.425708

The repeated-split standard deviation is a practical stability check. If fold randomness changed the conclusion, the final report would need to say so and rely on repeated cross-fitting or a more stable design.

Plot Sample-Split Stability

This plot shows the repeated estimates against the main estimate and true ATE. It is a compact way to communicate whether cross-fitting randomness matters materially.

fig, ax = plt.subplots(figsize=(10.5, 5.2))
ax.axhline(case_df["true_treatment_effect"].mean(), color="#111827", linestyle="--", linewidth=1.5, label="True ATE")
ax.axhline(main_row["estimate"], color="#b45309", linestyle=":", linewidth=1.8, label="Main estimate")
sns.pointplot(
    data=split_stability,
    x="split_seed",
    y="estimate",
    color="#2563eb",
    errorbar=None,
    ax=ax,
)
ax.set_title("Sample-Split Stability For The Main IRM Specification")
ax.set_xlabel("Sample-split seed")
ax.set_ylabel("Estimated ATE")
ax.tick_params(axis="x", rotation=45)
ax.legend(loc="upper right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sample_split_stability.png", dpi=160, bbox_inches="tight")
plt.show()

The repeated estimates cluster around the same value. This supports the numerical stability of the main estimate under alternative fold assignments.

Sensitivity Analysis For Hidden Confounding

The final statistical check uses DoubleML’s sensitivity analysis. The parameters cf_y and cf_d describe the strength of an unobserved confounder in the outcome and treatment equations. The parameter rho controls the direction of confounding. These are not discovered from the data; they are stress-test settings.

def extract_sensitivity_row(model, scenario, cf_y, cf_d, rho):
    params = model.sensitivity_params
    return {
        "scenario": scenario,
        "cf_y": cf_y,
        "cf_d": cf_d,
        "rho": rho,
        "theta_lower": float(params["theta"]["lower"][0]),
        "theta": float(model.coef[0]),
        "theta_upper": float(params["theta"]["upper"][0]),
        "ci_lower": float(params["ci"]["lower"][0]),
        "ci_upper": float(params["ci"]["upper"][0]),
        "rv_percent": float(params["rv"][0] * 100),
        "rva_percent": float(params["rva"][0] * 100),
    }

sensitivity_specs = [
    ("mild", 0.01, 0.01, 1.0),
    ("moderate", 0.03, 0.03, 1.0),
    ("strong", 0.06, 0.06, 1.0),
]

sensitivity_rows = []
for scenario, cf_y, cf_d, rho in sensitivity_specs:
    main_irm.sensitivity_analysis(cf_y=cf_y, cf_d=cf_d, rho=rho, level=0.95, null_hypothesis=0.0)
    sensitivity_rows.append(extract_sensitivity_row(main_irm, scenario, cf_y, cf_d, rho))

sensitivity_summary = pd.DataFrame(sensitivity_rows)
save_table(sensitivity_summary, f"{NOTEBOOK_PREFIX}_sensitivity_summary.csv")
display(sensitivity_summary)

# Restore the moderate scenario as the active sensitivity setting for anyone inspecting the model object later.
main_irm.sensitivity_analysis(cf_y=0.03, cf_d=0.03, rho=1.0, level=0.95, null_hypothesis=0.0)

	scenario	cf_y	cf_d	rho	theta_lower	theta	theta_upper	ci_lower	ci_upper	rv_percent	rva_percent
0	mild	0.01	0.01	1.0	0.440392	0.461497	0.482602	0.377922	0.544962	19.694041	17.213465
1	moderate	0.03	0.03	1.0	0.397532	0.461497	0.525462	0.334932	0.587729	19.694041	17.213465
2	strong	0.06	0.06	1.0	0.331541	0.461497	0.591453	0.268705	0.653612	19.694041	17.213465

<doubleml.irm.irm.DoubleMLIRM at 0x78c3c2bb5fd0>

The sensitivity table asks how much the estimate could move under increasingly strong hidden-confounding scenarios. It does not prove that hidden confounding is absent. It makes the remaining assumption visible.

Plot Sensitivity Bounds

The sensitivity plot shows how the lower and upper treatment-effect bounds change as the hidden-confounding stress test gets stronger.

fig, ax = plt.subplots(figsize=(10.5, 5.2))
positions = np.arange(len(sensitivity_summary))
ax.plot(positions, sensitivity_summary["theta_lower"], marker="o", color="#2563eb", label="Theta lower")
ax.plot(positions, sensitivity_summary["theta_upper"], marker="o", color="#b45309", label="Theta upper")
ax.axhline(0, color="#111827", linestyle="--", linewidth=1, label="Zero effect")
ax.set_xticks(positions)
ax.set_xticklabels(sensitivity_summary["scenario"])
ax.set_ylabel("Sensitivity-adjusted effect bound")
ax.set_xlabel("Hidden-confounding stress scenario")
ax.set_title("Sensitivity Bounds For The Main IRM Estimate")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sensitivity_bounds.png", dpi=160, bbox_inches="tight")
plt.show()

The sensitivity bounds communicate the main limitation of observational analysis: the result can be numerically precise and still depend on an untestable adjustment assumption.

Final Evidence Scorecard

The scorecard pulls together the pieces a reviewer should see before trusting the result. It includes design, overlap, nuisance quality, learner robustness, split stability, heterogeneity, and sensitivity.

evidence_scorecard = pd.DataFrame(
    [
        {
            "evidence_area": "Design",
            "finding": "Only pre-treatment controls are used in the main feature set.",
            "status": "documented",
        },
        {
            "evidence_area": "Main estimate",
            "finding": f"Main IRM ATE is {main_row['estimate']:.3f} with 95% CI [{main_row['ci_95_lower']:.3f}, {main_row['ci_95_upper']:.3f}].",
            "status": "estimated",
        },
        {
            "evidence_area": "Overlap",
            "finding": f"Estimated propensity p05={main_row['propensity_p05']:.3f}, p95={main_row['propensity_p95']:.3f}.",
            "status": "checked",
        },
        {
            "evidence_area": "Learner robustness",
            "finding": f"Learner-family estimates range from {learner_comparison['estimate'].min():.3f} to {learner_comparison['estimate'].max():.3f}.",
            "status": "checked",
        },
        {
            "evidence_area": "Sample splitting",
            "finding": f"Repeated split SD is {split_stability_summary.loc[0, 'sd_estimate']:.4f}.",
            "status": "checked",
        },
        {
            "evidence_area": "Heterogeneity",
            "finding": "DR signal summaries suggest larger effects for higher-intent users.",
            "status": "exploratory",
        },
        {
            "evidence_area": "Sensitivity",
            "finding": "Sensitivity bounds are reported for mild, moderate, and strong hidden-confounding scenarios.",
            "status": "stress tested",
        },
        {
            "evidence_area": "Limitations",
            "finding": "Unmeasured confounding, measurement error, and deployment interference remain design risks.",
            "status": "must report",
        },
    ]
)

save_table(evidence_scorecard, f"{NOTEBOOK_PREFIX}_evidence_scorecard.csv")
display(evidence_scorecard)

	evidence_area	finding	status
0	Design	Only pre-treatment controls are used in the ma...	documented
1	Main estimate	Main IRM ATE is 0.461 with 95% CI [0.387, 0.536].	estimated
2	Overlap	Estimated propensity p05=0.209, p95=0.732.	checked
3	Learner robustness	Learner-family estimates range from 0.334 to 0...	checked
4	Sample splitting	Repeated split SD is 0.0066.	checked
5	Heterogeneity	DR signal summaries suggest larger effects for...	exploratory
6	Sensitivity	Sensitivity bounds are reported for mild, mode...	stress tested
7	Limitations	Unmeasured confounding, measurement error, and...	must report

The scorecard is a compact review object. It is the bridge between a technical notebook and a reader who wants to know whether the analysis is credible.

Write The Final Case Study Report

The report template below is written as if this were the final deliverable. It includes the causal question, estimator, main estimate, diagnostics, sensitivity, subgroup findings, and limitations.

report_text = rf"""# End-To-End DoubleML Case Study Report

## Causal Question
Estimate the average effect of receiving a guided onboarding nudge on next-period user value.

## Estimand
The target estimand is the average treatment effect:

$$
\theta_0 = \mathbb{{E}}[Y(1) - Y(0)].
$$

## Identification Assumptions
The analysis assumes conditional exchangeability given the documented pre-treatment controls, overlap between treated and untreated users, stable treatment definition, and no interference across users.

## Main DoubleML Specification
- Estimator: `DoubleMLIRM`
- Score: ATE
- Outcome learner: random forest regressor
- Propensity learner: random forest classifier
- Cross-fitting: 5 folds
- Propensity clipping threshold: 0.02

## Main Estimate
- Estimated ATE: {main_row['estimate']:.4f}
- Standard error: {main_row['std_error']:.4f}
- 95% confidence interval: [{main_row['ci_95_lower']:.4f}, {main_row['ci_95_upper']:.4f}]
- Known true ATE in this teaching simulation: {main_row['true_ate']:.4f}

## Diagnostics
- Estimated propensity p05 / p95: {main_row['propensity_p05']:.4f} / {main_row['propensity_p95']:.4f}
- Outcome RMSE under control model: {main_row['ml_g0_rmse']:.4f}
- Outcome RMSE under treated model: {main_row['ml_g1_rmse']:.4f}
- Propensity log loss: {main_row['ml_m_log_loss']:.4f}
- Sample-split estimate SD: {split_stability_summary.loc[0, 'sd_estimate']:.4f}

## Heterogeneity
The doubly robust signal suggests larger effects for higher-intent users. Treat this as subgroup evidence, not as a fully validated targeting policy.

## Sensitivity
Sensitivity bounds are saved for mild, moderate, and strong hidden-confounding scenarios. These stress tests do not remove hidden-confounding risk; they describe how the estimate would move under specified scenarios.

## Limitations
The analysis remains observational. It can be threatened by missing confounders, bad measurement, outcome leakage, violations of overlap, interference, or changes in the treatment definition.

## Artifact Paths
- Data: `{DATASET_DIR / f'{NOTEBOOK_PREFIX}_end_to_end_case_study_data.csv'}`
- Main estimate: `{TABLE_DIR / f'{NOTEBOOK_PREFIX}_main_irm_estimate.csv'}`
- Learner comparison: `{TABLE_DIR / f'{NOTEBOOK_PREFIX}_learner_family_comparison.csv'}`
- Segment summary: `{TABLE_DIR / f'{NOTEBOOK_PREFIX}_segment_dr_signal_summary.csv'}`
- Sensitivity summary: `{TABLE_DIR / f'{NOTEBOOK_PREFIX}_sensitivity_summary.csv'}`
- Evidence scorecard: `{TABLE_DIR / f'{NOTEBOOK_PREFIX}_evidence_scorecard.csv'}`
"""

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_end_to_end_case_study_report.md"
report_path.write_text(report_text)
print(f"Wrote report to: {report_path}")

Wrote report to: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/18_end_to_end_case_study_report.md

The report is short on purpose. A final case-study report should not reproduce every notebook cell. It should tell a clear story and point to the artifacts that support that story.

Artifact Manifest

The final manifest lists every major output from this notebook. This makes the case study easy to review outside the notebook interface.

artifact_manifest = pd.DataFrame(
    [
        {"artifact": "Case study dataset", "path": str(DATASET_DIR / f"{NOTEBOOK_PREFIX}_end_to_end_case_study_data.csv")},
        {"artifact": "Roadmap", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_case_study_roadmap.csv")},
        {"artifact": "Field dictionary", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_field_dictionary.csv")},
        {"artifact": "Data audit", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_data_audit.csv")},
        {"artifact": "Baseline estimates", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_estimates.csv")},
        {"artifact": "Main IRM estimate", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_main_irm_estimate.csv")},
        {"artifact": "Nuisance diagnostics", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_main_nuisance_diagnostics.csv")},
        {"artifact": "Propensity diagnostics", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_propensity_diagnostics.csv")},
        {"artifact": "Learner family comparison", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_learner_family_comparison.csv")},
        {"artifact": "Segment DR signal summary", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_segment_dr_signal_summary.csv")},
        {"artifact": "Sample split stability", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_split_stability.csv")},
        {"artifact": "Sensitivity summary", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sensitivity_summary.csv")},
        {"artifact": "Evidence scorecard", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_evidence_scorecard.csv")},
        {"artifact": "Design DAG", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_case_study_design_dag.png")},
        {"artifact": "Main estimate comparison", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_main_estimate_comparison.png")},
        {"artifact": "Propensity overlap", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_propensity_overlap.png")},
        {"artifact": "Learner estimates", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_learner_family_estimates.png")},
        {"artifact": "Segment effects", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_segment_effects.png")},
        {"artifact": "Sample split stability figure", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sample_split_stability.png")},
        {"artifact": "Sensitivity bounds", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sensitivity_bounds.png")},
        {"artifact": "Final report", "path": str(REPORT_DIR / f"{NOTEBOOK_PREFIX}_end_to_end_case_study_report.md")},
    ]
)

save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")
display(artifact_manifest)

	artifact	path
0	Case study dataset	/home/apex/Documents/ranking_sys/notebooks/tut...
1	Roadmap	/home/apex/Documents/ranking_sys/notebooks/tut...
2	Field dictionary	/home/apex/Documents/ranking_sys/notebooks/tut...
3	Data audit	/home/apex/Documents/ranking_sys/notebooks/tut...
4	Baseline estimates	/home/apex/Documents/ranking_sys/notebooks/tut...
5	Main IRM estimate	/home/apex/Documents/ranking_sys/notebooks/tut...
6	Nuisance diagnostics	/home/apex/Documents/ranking_sys/notebooks/tut...
7	Propensity diagnostics	/home/apex/Documents/ranking_sys/notebooks/tut...
8	Learner family comparison	/home/apex/Documents/ranking_sys/notebooks/tut...
9	Segment DR signal summary	/home/apex/Documents/ranking_sys/notebooks/tut...
10	Sample split stability	/home/apex/Documents/ranking_sys/notebooks/tut...
11	Sensitivity summary	/home/apex/Documents/ranking_sys/notebooks/tut...
12	Evidence scorecard	/home/apex/Documents/ranking_sys/notebooks/tut...
13	Design DAG	/home/apex/Documents/ranking_sys/notebooks/tut...
14	Main estimate comparison	/home/apex/Documents/ranking_sys/notebooks/tut...
15	Propensity overlap	/home/apex/Documents/ranking_sys/notebooks/tut...
16	Learner estimates	/home/apex/Documents/ranking_sys/notebooks/tut...
17	Segment effects	/home/apex/Documents/ranking_sys/notebooks/tut...
18	Sample split stability figure	/home/apex/Documents/ranking_sys/notebooks/tut...
19	Sensitivity bounds	/home/apex/Documents/ranking_sys/notebooks/tut...
20	Final report	/home/apex/Documents/ranking_sys/notebooks/tut...

The manifest is the table of contents for the case study outputs. It also makes the notebook easier to grade, review, or convert into a portfolio writeup.

Tutorial Wrap-Up

This case study pulls the DoubleML tutorial series together. The workflow is intentionally repeatable:

Define the causal question and target estimand.
Document variable timing and exclude invalid controls.
Choose the DoubleML class that matches the treatment and data structure.
Fit flexible nuisance models with cross-fitting.
Report uncertainty, overlap, nuisance diagnostics, learner robustness, and split stability.
Treat subgroup findings as exploratory unless the heterogeneity design is formal.
Use sensitivity analysis to describe, not erase, hidden-confounding risk.
End with a concise report and reviewable artifacts.

The main lesson is that DoubleML is a disciplined estimation framework. The quality of the answer still depends on the quality of the causal design.