EconML Tutorial 13: Estimator Comparison Benchmark

A causal inference library gives us many estimators, but the important practical question is not “which estimator sounds most advanced?” The important question is: which estimator is best matched to the treatment assignment process, outcome structure, sample size, overlap, and decision goal?

This notebook builds a controlled benchmark where every estimator sees the same data, the same train-test split, and the same evaluation metrics. Because the data are synthetic, we know the true CATE for every held-out row. That lets us compare estimators in a way that real observational data cannot.

We will compare six EconML estimators:

LinearDML
CausalForestDML
DRLearner
SLearner
TLearner
XLearner

The benchmark evaluates more than CATE RMSE. We also examine ATE bias, CATE ranking quality, top-group targeting value, decile calibration, segment behavior, and runtime. The goal is to build a habit of comparing estimators across the decision surface, not just one leaderboard metric.

Learning Goals

By the end of this notebook, you should be able to:

design a fair benchmark for heterogeneous treatment-effect estimators;
keep the data split, features, and metrics identical across estimators;
compare orthogonal estimators with meta-learners;
evaluate CATE recovery, ATE bias, ranking quality, and policy-style targeting value;
diagnose estimator behavior by CATE decile and by business-relevant segment;
explain why the “best” estimator depends on the use case, not only a single score.

Tutorial Flow

The notebook has five parts:

Create a nonlinear binary-treatment teaching dataset with confounded assignment and known CATE.
Run basic diagnostics: treatment balance, propensity overlap, and true effect distribution.
Fit six EconML estimators on the same training data.
Compare held-out performance across recovery, ranking, calibration, targeting, and runtime.
Summarize how to choose an estimator based on the observed benchmark pattern.

Every result-producing code cell is followed by a short discussion so the benchmark reads like an analysis, not just a list of tables.

Setup

This cell imports the notebook dependencies, creates output folders, sets plotting defaults, and suppresses harmless display warnings. All code remains visible so students can rerun and modify the benchmark.

from pathlib import Path
import os
import time
import warnings

# Suppress optional widget warnings that can appear while importing EconML in headless notebook runs.
warnings.filterwarnings("ignore", message="IProgress not found.*")

# Keep Matplotlib cache files in a writable location during notebook execution.
os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib")

import econml
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from scipy.special import expit
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

from econml.dml import CausalForestDML, LinearDML
from econml.dr import DRLearner
from econml.metalearners import SLearner, TLearner, XLearner

warnings.filterwarnings("ignore", message="X does not have valid feature names.*", category=UserWarning)
warnings.filterwarnings("ignore", message="Not all column names are strings.*", category=UserWarning)
warnings.filterwarnings("ignore", message="Co-variance matrix is underdetermined.*", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["axes.titleweight"] = "bold"
plt.rcParams["axes.labelsize"] = 11


def find_project_root(start=None):
    """Find the repository root from either the repo or a nested notebook folder."""
    start = Path.cwd() if start is None else Path(start)
    for candidate in [start, *start.parents]:
        if (candidate / "pyproject.toml").exists() and (candidate / "notebooks").exists():
            return candidate
    return Path.cwd()


PROJECT_ROOT = find_project_root()
NOTEBOOK_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "econml"
OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"

FIGURE_DIR.mkdir(parents=True, exist_ok=True)
TABLE_DIR.mkdir(parents=True, exist_ok=True)

rng = np.random.default_rng(202613)

print(f"Project root: {PROJECT_ROOT}")
print(f"EconML version: {econml.__version__}")
print(f"Figures will be saved to: {FIGURE_DIR.relative_to(PROJECT_ROOT)}")
print(f"Tables will be saved to: {TABLE_DIR.relative_to(PROJECT_ROOT)}")

Project root: /home/apex/Documents/ranking_sys
EconML version: 0.16.0
Figures will be saved to: notebooks/tutorials/econml/outputs/figures
Tables will be saved to: notebooks/tutorials/econml/outputs/tables

The environment is ready. All artifacts from this benchmark use the 13_ prefix.

Benchmark Design Principles

Before creating data, we define what “fair comparison” means. The estimators differ internally, but the benchmark should hold the external setup fixed.

benchmark_principles = pd.DataFrame(
    [
        {
            "principle": "Same data split",
            "how_this_notebook_applies_it": "Every estimator trains on the same training rows and is evaluated on the same held-out rows.",
            "why_it_matters": "Different splits can make estimator differences look larger or smaller than they are.",
        },
        {
            "principle": "Same feature set",
            "how_this_notebook_applies_it": "All estimators receive the same observed pre-treatment covariates.",
            "why_it_matters": "Feature differences should not be confused with estimator differences.",
        },
        {
            "principle": "Multiple metrics",
            "how_this_notebook_applies_it": "We compare ATE bias, CATE RMSE, correlation, targeting value, calibration, and runtime.",
            "why_it_matters": "One metric rarely captures all decision needs.",
        },
        {
            "principle": "Ground-truth teaching data",
            "how_this_notebook_applies_it": "The synthetic data include true CATE for held-out evaluation.",
            "why_it_matters": "Real data cannot directly validate row-level CATE recovery.",
        },
        {
            "principle": "Decision-aware ranking",
            "how_this_notebook_applies_it": "We evaluate top-group true effects, not only pointwise errors.",
            "why_it_matters": "Many CATE workflows are used for prioritization rather than exact prediction.",
        },
    ]
)

benchmark_principles.to_csv(TABLE_DIR / "13_benchmark_principles.csv", index=False)
display(benchmark_principles)

	principle	how_this_notebook_applies_it	why_it_matters
0	Same data split	Every estimator trains on the same training ro...	Different splits can make estimator difference...
1	Same feature set	All estimators receive the same observed pre-t...	Feature differences should not be confused wit...
2	Multiple metrics	We compare ATE bias, CATE RMSE, correlation, t...	One metric rarely captures all decision needs.
3	Ground-truth teaching data	The synthetic data include true CATE for held-...	Real data cannot directly validate row-level C...
4	Decision-aware ranking	We evaluate top-group true effects, not only p...	Many CATE workflows are used for prioritizatio...

These principles keep the benchmark honest. We want estimator differences to come from estimator behavior, not from accidental differences in preprocessing or evaluation.

Estimator Map

This table summarizes the estimator families included in the benchmark. The point is not to declare a universal winner; it is to make each estimator’s modeling strategy explicit.

estimator_map = pd.DataFrame(
    [
        {
            "estimator": "LinearDML",
            "family": "orthogonal DML",
            "core_idea": "Residualize outcome and treatment, then fit a structured final CATE model.",
            "expected_strength": "Good when effect heterogeneity is fairly smooth or well represented by chosen features.",
        },
        {
            "estimator": "CausalForestDML",
            "family": "orthogonal forest",
            "core_idea": "Use forest splits to learn nonlinear heterogeneous effects with orthogonalization.",
            "expected_strength": "Good for nonlinear heterogeneity and segment discovery.",
        },
        {
            "estimator": "DRLearner",
            "family": "doubly robust",
            "core_idea": "Combine outcome and propensity nuisance models into a pseudo-outcome for CATE learning.",
            "expected_strength": "Good when either propensity or outcome modeling is strong enough to stabilize estimates.",
        },
        {
            "estimator": "SLearner",
            "family": "meta-learner",
            "core_idea": "Fit one outcome model using treatment as a feature, then contrast predicted outcomes.",
            "expected_strength": "Simple baseline that can work when treatment interactions are learned well.",
        },
        {
            "estimator": "TLearner",
            "family": "meta-learner",
            "core_idea": "Fit separate outcome models for treated and control rows, then subtract predictions.",
            "expected_strength": "Useful when treated and control outcome surfaces differ strongly and both groups have support.",
        },
        {
            "estimator": "XLearner",
            "family": "meta-learner",
            "core_idea": "Impute treatment effects in each arm and combine them using propensity information.",
            "expected_strength": "Often useful when treatment groups are imbalanced or effects are heterogeneous.",
        },
    ]
)

estimator_map.to_csv(TABLE_DIR / "13_estimator_map.csv", index=False)
display(estimator_map)

	estimator	family	core_idea	expected_strength
0	LinearDML	orthogonal DML	Residualize outcome and treatment, then fit a ...	Good when effect heterogeneity is fairly smoot...
1	CausalForestDML	orthogonal forest	Use forest splits to learn nonlinear heterogen...	Good for nonlinear heterogeneity and segment d...
2	DRLearner	doubly robust	Combine outcome and propensity nuisance models...	Good when either propensity or outcome modelin...
3	SLearner	meta-learner	Fit one outcome model using treatment as a fea...	Simple baseline that can work when treatment i...
4	TLearner	meta-learner	Fit separate outcome models for treated and co...	Useful when treated and control outcome surfac...
5	XLearner	meta-learner	Impute treatment effects in each arm and combi...	Often useful when treatment groups are imbalan...

The estimators differ in how they use nuisance models, treatment assignment, and outcome surfaces. That is why the same dataset can favor different methods under different metrics.

Teaching Data Design

The synthetic data include nonlinear treatment effects, confounded binary treatment assignment, and realistic overlap. Treatment assignment depends on observed covariates, so naive treated-control comparisons will be biased.

def true_cate_function(frame):
    """Ground-truth CATE used for synthetic benchmark evaluation."""
    return (
        0.38
        + 0.34 * frame["baseline_need"]
        + 0.22 * np.tanh(frame["content_affinity"])
        - 0.24 * np.maximum(frame["friction_score"], 0)
        - 0.18 * frame["price_sensitivity"]
        - 0.12 * frame["region_risk"]
        + 0.18 * frame["high_need_segment"] * (frame["content_affinity"] > 0).astype(float)
        + 0.12 * np.sin(frame["baseline_need"] + frame["novelty_preference"])
        + 0.08 * frame["device_stability"] * frame["content_affinity"]
    )


n = 4_000
baseline_need = rng.normal(0, 1, size=n)
prior_engagement = rng.normal(0, 1, size=n)
friction_score = rng.normal(0, 1, size=n)
content_affinity = rng.normal(0, 1, size=n)
novelty_preference = rng.normal(0, 1, size=n)
price_sensitivity = rng.beta(2.2, 3.0, size=n)
account_tenure = rng.gamma(shape=3.0, scale=16.0, size=n)
region_risk = rng.binomial(1, 0.28, size=n)
device_stability = rng.beta(4.0, 2.0, size=n)
high_need_segment = (baseline_need > 0.55).astype(int)
low_friction_segment = (friction_score < -0.25).astype(int)

features = pd.DataFrame(
    {
        "baseline_need": baseline_need,
        "prior_engagement": prior_engagement,
        "friction_score": friction_score,
        "content_affinity": content_affinity,
        "novelty_preference": novelty_preference,
        "price_sensitivity": price_sensitivity,
        "account_tenure": account_tenure,
        "region_risk": region_risk,
        "device_stability": device_stability,
        "high_need_segment": high_need_segment,
        "low_friction_segment": low_friction_segment,
    }
)

true_cate = true_cate_function(features)
propensity = expit(
    -0.15
    + 0.58 * baseline_need
    + 0.35 * prior_engagement
    - 0.38 * friction_score
    + 0.28 * content_affinity
    - 0.40 * price_sensitivity
    + 0.20 * novelty_preference
    + 0.22 * region_risk
)
propensity = np.clip(propensity, 0.04, 0.96)
treatment = rng.binomial(1, propensity, size=n)

baseline_outcome = (
    1.05
    + 0.55 * baseline_need
    + 0.42 * prior_engagement
    - 0.32 * friction_score
    + 0.26 * content_affinity
    + 0.18 * novelty_preference
    - 0.20 * price_sensitivity
    + 0.10 * np.log1p(account_tenure)
    - 0.18 * region_risk
    + 0.12 * device_stability
    + 0.10 * np.sin(content_affinity)
)
noise = rng.normal(0, 0.70 + 0.15 * np.maximum(friction_score, 0), size=n)
outcome = baseline_outcome + treatment * true_cate + noise

benchmark_df = features.assign(
    treatment=treatment,
    outcome=outcome,
    true_propensity=propensity,
    true_cate=true_cate,
    baseline_outcome=baseline_outcome,
)

benchmark_df.to_csv(TABLE_DIR / "13_benchmark_teaching_data.csv", index=False)
benchmark_df.head()

	baseline_need	prior_engagement	friction_score	content_affinity	novelty_preference	price_sensitivity	account_tenure	region_risk	device_stability	high_need_segment	treatment	outcome	true_propensity	true_cate	baseline_outcome
0	1.032714	-0.012577	2.409001	1.352976	1.053867	0.557824	96.371016	0	0.776164	1	0	2.631807	0.473931	0.613399	1.920365
1	2.167932	-0.961431	0.350214	-0.587670	0.679030	0.659541	27.196253	0	0.704150	1	0	1.827384	0.585468	0.799862	1.926991
2	-0.605126	-0.138752	-0.212956	-0.792624	1.518319	0.690447	63.876623	0	0.787881	0	0	0.743609	0.340052	-0.050184	1.096751
3	1.279988	-0.598436	1.487410	-0.138919	1.809341	0.377421	73.808594	0	0.262653	1	0	0.058752	0.497432	0.363265	1.689921
4	0.802231	0.588796	0.752663	-0.874492	-0.266609	0.649335	38.721834	1	0.973615	1	1	1.084485	0.474384	0.073568	1.320747

The first rows show the complete teaching dataset. The models will use observed covariates, treatment, and outcome; the true CATE and true propensity are kept only for diagnostics.

Field Dictionary

This table describes the fields used in the benchmark. Clear field roles help students separate observed inputs from teaching-only truth.

field_dictionary = pd.DataFrame(
    [
        ("baseline_need", "Observed covariate", "Pre-treatment demand or need signal."),
        ("prior_engagement", "Observed covariate", "Pre-treatment engagement signal."),
        ("friction_score", "Observed covariate", "Higher values mean more friction."),
        ("content_affinity", "Observed covariate", "Match between row and content or offer."),
        ("novelty_preference", "Observed covariate", "Preference for exploratory or new experiences."),
        ("price_sensitivity", "Observed covariate", "Sensitivity to cost, effort, or inconvenience."),
        ("account_tenure", "Observed covariate", "Age of the account or relationship in weeks."),
        ("region_risk", "Observed covariate", "Binary marker for lower baseline outcome regions."),
        ("device_stability", "Observed covariate", "Technical stability or reliability signal."),
        ("high_need_segment", "Observed covariate", "Binary segment derived from baseline need."),
        ("low_friction_segment", "Observed covariate", "Binary segment derived from friction score."),
        ("treatment", "Treatment", "Binary intervention assignment."),
        ("outcome", "Outcome", "Observed post-treatment outcome."),
        ("true_propensity", "Teaching-only truth", "Known treatment probability from the synthetic assignment process."),
        ("true_cate", "Teaching-only truth", "Known conditional treatment effect for each row."),
        ("baseline_outcome", "Teaching-only truth", "Potential baseline outcome component before treatment effect and noise."),
    ],
    columns=["field", "role", "description"],
)

field_dictionary.to_csv(TABLE_DIR / "13_field_dictionary.csv", index=False)
display(field_dictionary)

	field	role	description
0	baseline_need	Observed covariate	Pre-treatment demand or need signal.
1	prior_engagement	Observed covariate	Pre-treatment engagement signal.
2	friction_score	Observed covariate	Higher values mean more friction.
3	content_affinity	Observed covariate	Match between row and content or offer.
4	novelty_preference	Observed covariate	Preference for exploratory or new experiences.
5	price_sensitivity	Observed covariate	Sensitivity to cost, effort, or inconvenience.
6	account_tenure	Observed covariate	Age of the account or relationship in weeks.
7	region_risk	Observed covariate	Binary marker for lower baseline outcome regions.
8	device_stability	Observed covariate	Technical stability or reliability signal.
9	high_need_segment	Observed covariate	Binary segment derived from baseline need.
10	low_friction_segment	Observed covariate	Binary segment derived from friction score.
11	treatment	Treatment	Binary intervention assignment.
12	outcome	Outcome	Observed post-treatment outcome.
13	true_propensity	Teaching-only truth	Known treatment probability from the synthetic...
14	true_cate	Teaching-only truth	Known conditional treatment effect for each row.
15	baseline_outcome	Teaching-only truth	Potential baseline outcome component before tr...

The teaching-only fields make benchmarking possible. Real applications usually cannot compute CATE RMSE because the true row-level treatment effect is unknown.

Basic Data Summary

This cell summarizes the treatment rate, outcome distribution, true effect distribution, and overlap range. These checks establish the difficulty of the benchmark before fitting any estimator.

basic_summary = pd.DataFrame(
    {
        "metric": [
            "rows",
            "columns",
            "treatment_rate",
            "outcome_mean",
            "outcome_std",
            "true_ate",
            "true_cate_std",
            "share_positive_true_cate",
            "min_true_propensity",
            "max_true_propensity",
        ],
        "value": [
            len(benchmark_df),
            benchmark_df.shape[1],
            benchmark_df["treatment"].mean(),
            benchmark_df["outcome"].mean(),
            benchmark_df["outcome"].std(),
            benchmark_df["true_cate"].mean(),
            benchmark_df["true_cate"].std(),
            (benchmark_df["true_cate"] > 0).mean(),
            benchmark_df["true_propensity"].min(),
            benchmark_df["true_propensity"].max(),
        ],
    }
)

basic_summary.to_csv(TABLE_DIR / "13_basic_summary.csv", index=False)
display(basic_summary)

	metric	value
0	rows	4000.000000
1	columns	16.000000
2	treatment_rate	0.439750
3	outcome_mean	1.541152
4	outcome_std	1.368809
5	true_ate	0.198488
6	true_cate_std	0.507004
7	share_positive_true_cate	0.640000
8	min_true_propensity	0.040000
9	max_true_propensity	0.944819

The true CATE has meaningful spread, so estimator ranking and heterogeneity recovery matter. This is not just an average-effect problem.

True CATE Distribution

The true CATE distribution gives a visual sense of the benchmark target. Estimators should recover both the level and ranking of this distribution on held-out data.

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(benchmark_df["true_cate"], bins=50, kde=True, color="#2563eb", ax=ax)
ax.axvline(benchmark_df["true_cate"].mean(), color="#111827", linestyle="--", linewidth=1.4, label="True ATE")
ax.axvline(0, color="#b91c1c", linestyle=":", linewidth=1.4, label="Zero effect")
ax.set_title("True Treatment-Effect Heterogeneity")
ax.set_xlabel("True CATE")
ax.set_ylabel("Number of rows")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_true_cate_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

The distribution contains low-benefit and high-benefit regions. A good estimator should not only estimate the mean, but also identify the right tail well enough for targeting.

Naive Treated-Control Difference

Before fitting causal estimators, we compute the raw treated-control difference. This is not a causal estimate because treatment assignment is confounded.

raw_group_summary = (
    benchmark_df.groupby("treatment")
    .agg(
        rows=("outcome", "size"),
        outcome_mean=("outcome", "mean"),
        true_cate_mean=("true_cate", "mean"),
        true_propensity_mean=("true_propensity", "mean"),
        high_need_rate=("high_need_segment", "mean"),
        low_friction_rate=("low_friction_segment", "mean"),
    )
    .reset_index()
)
raw_difference = raw_group_summary.loc[raw_group_summary["treatment"] == 1, "outcome_mean"].iloc[0] - raw_group_summary.loc[raw_group_summary["treatment"] == 0, "outcome_mean"].iloc[0]
raw_difference_vs_truth = pd.DataFrame(
    {
        "quantity": ["raw_treated_minus_control", "true_ate", "raw_minus_true_ate"],
        "value": [raw_difference, benchmark_df["true_cate"].mean(), raw_difference - benchmark_df["true_cate"].mean()],
    }
)

raw_group_summary.to_csv(TABLE_DIR / "13_raw_group_summary.csv", index=False)
raw_difference_vs_truth.to_csv(TABLE_DIR / "13_raw_difference_vs_truth.csv", index=False)
display(raw_group_summary)
display(raw_difference_vs_truth)

	treatment	rows	outcome_mean	true_cate_mean	true_propensity_mean	high_need_rate	low_friction_rate
0	0	2241	1.105038	0.065412	0.388476	0.211959	0.352521
1	1	1759	2.096770	0.368029	0.524493	0.387152	0.475270

	quantity	value
0	raw_treated_minus_control	0.991732
1	true_ate	0.198488
2	raw_minus_true_ate	0.793244

The raw difference mixes true treatment effect with selection into treatment. The benchmark therefore needs estimators that adjust for observed confounding.

Covariate Balance

This cell computes standardized mean differences between treated and control rows. Large imbalance tells us where adjustment is needed.

feature_cols = [
    "baseline_need",
    "prior_engagement",
    "friction_score",
    "content_affinity",
    "novelty_preference",
    "price_sensitivity",
    "account_tenure",
    "region_risk",
    "device_stability",
    "high_need_segment",
    "low_friction_segment",
]

balance_rows = []
for feature in feature_cols:
    treated_values = benchmark_df.loc[benchmark_df["treatment"] == 1, feature]
    control_values = benchmark_df.loc[benchmark_df["treatment"] == 0, feature]
    pooled_sd = np.sqrt((treated_values.var(ddof=1) + control_values.var(ddof=1)) / 2)
    smd = (treated_values.mean() - control_values.mean()) / pooled_sd
    balance_rows.append(
        {
            "feature": feature,
            "treated_mean": treated_values.mean(),
            "control_mean": control_values.mean(),
            "standardized_mean_difference": smd,
            "absolute_smd": abs(smd),
        }
    )

balance_table = pd.DataFrame(balance_rows).sort_values("absolute_smd", ascending=False)
balance_table.to_csv(TABLE_DIR / "13_covariate_balance.csv", index=False)
display(balance_table)

	feature	treated_mean	control_mean	standardized_mean_difference	absolute_smd
0	baseline_need	0.293826	-0.243031	0.538938	0.538938
9	high_need_segment	0.387152	0.211959	0.389555	0.389555
2	friction_score	-0.193799	0.120983	-0.322196	0.322196
1	prior_engagement	0.195121	-0.105398	0.301412	0.301412
10	low_friction_segment	0.475270	0.352521	0.251115	0.251115
3	content_affinity	0.124215	-0.091619	0.218096	0.218096
4	novelty_preference	0.074636	-0.073350	0.150108	0.150108
7	region_risk	0.306424	0.277555	0.063510	0.063510
5	price_sensitivity	0.416023	0.422035	-0.030084	0.030084
6	account_tenure	48.985911	48.214979	0.028049	0.028049
8	device_stability	0.668324	0.668905	-0.003238	0.003238

The imbalance pattern confirms that treatment assignment is observational. This creates a meaningful benchmark for methods that model propensity, outcomes, or both.

Balance Plot

The plot makes the magnitude and direction of imbalance easier to scan than the table alone.

fig, ax = plt.subplots(figsize=(10, 6))
plot_balance = balance_table.sort_values("standardized_mean_difference")
sns.barplot(data=plot_balance, x="standardized_mean_difference", y="feature", color="#0284c7", ax=ax)
ax.axvline(0, color="#111827", linewidth=1)
ax.axvline(0.10, color="#b91c1c", linestyle="--", linewidth=1)
ax.axvline(-0.10, color="#b91c1c", linestyle="--", linewidth=1)
ax.set_title("Treatment-Control Covariate Imbalance")
ax.set_xlabel("Standardized mean difference")
ax.set_ylabel("Feature")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_covariate_balance.png", dpi=160, bbox_inches="tight")
plt.show()

Several features cross the practical imbalance threshold. The estimators will need to account for these differences to recover the causal effect.

Propensity Overlap

Overlap is essential for comparing treated and control outcomes within similar covariate regions. Here we can inspect the true synthetic propensity.

fig, ax = plt.subplots(figsize=(10, 5))
sns.kdeplot(
    data=benchmark_df,
    x="true_propensity",
    hue="treatment",
    common_norm=False,
    fill=True,
    alpha=0.28,
    linewidth=1.4,
    ax=ax,
)
ax.set_title("True Propensity Overlap")
ax.set_xlabel("True propensity score")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_true_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The treated and control distributions overlap, but treatment is clearly not random. That is the sweet spot for a useful teaching benchmark: confounding exists, but overlap is not catastrophically bad.

Train-Test Split

We split the data once and reuse the split for every estimator. Stratifying on treatment keeps treatment rates similar across training and test data.

train_df, test_df = train_test_split(
    benchmark_df,
    test_size=0.35,
    random_state=202613,
    stratify=benchmark_df["treatment"],
)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

split_summary = pd.DataFrame(
    {
        "split": ["train", "test"],
        "rows": [len(train_df), len(test_df)],
        "treatment_rate": [train_df["treatment"].mean(), test_df["treatment"].mean()],
        "outcome_mean": [train_df["outcome"].mean(), test_df["outcome"].mean()],
        "true_ate": [train_df["true_cate"].mean(), test_df["true_cate"].mean()],
        "true_cate_std": [train_df["true_cate"].std(), test_df["true_cate"].std()],
    }
)

split_summary.to_csv(TABLE_DIR / "13_train_test_split_summary.csv", index=False)
display(split_summary)

	split	rows	treatment_rate	outcome_mean	true_ate	true_cate_std
0	train	2600	0.439615	1.536659	0.198415	0.505453
1	test	1400	0.440000	1.549496	0.198624	0.510053

The train and test splits are similar on treatment rate and true effect distribution. That makes held-out estimator comparisons more stable.

Model Matrices

This cell creates the model inputs shared by every estimator. The same feature matrix is used across the benchmark.

X_train = train_df[feature_cols].copy()
X_test = test_df[feature_cols].copy()
y_train = train_df["outcome"].to_numpy()
y_test = test_df["outcome"].to_numpy()
t_train = train_df["treatment"].to_numpy()
t_test = test_df["treatment"].to_numpy()
true_tau_test = test_df["true_cate"].to_numpy()

matrix_summary = pd.DataFrame(
    {
        "object": ["X_train", "X_test", "y_train", "t_train", "true_tau_test"],
        "shape_or_length": [X_train.shape, X_test.shape, len(y_train), len(t_train), len(true_tau_test)],
        "description": [
            "Shared covariate matrix for training.",
            "Shared covariate matrix for held-out evaluation.",
            "Training outcomes.",
            "Training treatment indicators.",
            "Teaching-only true CATE values for held-out evaluation.",
        ],
    }
)

matrix_summary.to_csv(TABLE_DIR / "13_model_matrix_summary.csv", index=False)
display(matrix_summary)

	object	shape_or_length	description
0	X_train	(2600, 11)	Shared covariate matrix for training.
1	X_test	(1400, 11)	Shared covariate matrix for held-out evaluation.
2	y_train	2600	Training outcomes.
3	t_train	2600	Training treatment indicators.
4	true_tau_test	1400	Teaching-only true CATE values for held-out ev...

The shared matrices are the foundation of a fair comparison. From here on, estimator differences are not due to different features or splits.

Nuisance Diagnostics

Before comparing CATE estimators, we check whether basic models can predict treatment and outcome. Strong nuisance signal is expected because treatment is confounded and the outcome is structured.

propensity_model = RandomForestClassifier(n_estimators=220, min_samples_leaf=25, random_state=202613, n_jobs=-1)
propensity_model.fit(X_train, t_train)
propensity_pred = np.clip(propensity_model.predict_proba(X_test)[:, 1], 1e-4, 1 - 1e-4)

outcome_model = RandomForestRegressor(n_estimators=220, min_samples_leaf=20, random_state=202614, n_jobs=-1)
outcome_model.fit(X_train.assign(treatment=t_train), y_train)
outcome_pred = outcome_model.predict(X_test.assign(treatment=t_test))

nuisance_diagnostics = pd.DataFrame(
    [
        {
            "model": "propensity_model",
            "metric_1": "roc_auc",
            "value_1": roc_auc_score(t_test, propensity_pred),
            "metric_2": "log_loss",
            "value_2": log_loss(t_test, propensity_pred),
        },
        {
            "model": "outcome_model",
            "metric_1": "rmse",
            "value_1": np.sqrt(mean_squared_error(y_test, outcome_pred)),
            "metric_2": "mae",
            "value_2": mean_absolute_error(y_test, outcome_pred),
        },
    ]
)

nuisance_diagnostics.to_csv(TABLE_DIR / "13_nuisance_diagnostics.csv", index=False)
display(nuisance_diagnostics)

	model	metric_1	value_1	metric_2	value_2
0	propensity_model	roc_auc	0.684963	log_loss	0.634215
1	outcome_model	rmse	0.873186	mae	0.691104

The propensity model can predict treatment assignment, so confounding is present. The outcome model also has signal, which should help estimators that rely on outcome nuisance models.

Define Estimators

This cell creates all estimators in the benchmark. The hyperparameters are intentionally moderate so the notebook runs quickly while still showing meaningful differences.

estimators = {
    "LinearDML": LinearDML(
        model_y=RandomForestRegressor(n_estimators=160, min_samples_leaf=20, random_state=202615, n_jobs=-1),
        model_t=RandomForestClassifier(n_estimators=160, min_samples_leaf=25, random_state=202616, n_jobs=-1),
        discrete_treatment=True,
        featurizer=PolynomialFeatures(degree=2, include_bias=False),
        cv=3,
        random_state=202617,
    ),
    "CausalForestDML": CausalForestDML(
        model_y=RandomForestRegressor(n_estimators=160, min_samples_leaf=20, random_state=202618, n_jobs=-1),
        model_t=RandomForestClassifier(n_estimators=160, min_samples_leaf=25, random_state=202619, n_jobs=-1),
        discrete_treatment=True,
        n_estimators=220,
        min_samples_leaf=18,
        max_samples=0.45,
        cv=3,
        random_state=202620,
        inference=True,
    ),
    "DRLearner": DRLearner(
        model_propensity=LogisticRegression(max_iter=1_000),
        model_regression=RandomForestRegressor(n_estimators=180, min_samples_leaf=20, random_state=202621, n_jobs=-1),
        model_final=RandomForestRegressor(n_estimators=220, min_samples_leaf=18, random_state=202622, n_jobs=-1),
        cv=3,
        random_state=202623,
    ),
    "SLearner": SLearner(
        overall_model=RandomForestRegressor(n_estimators=220, min_samples_leaf=18, random_state=202624, n_jobs=-1),
    ),
    "TLearner": TLearner(
        models=RandomForestRegressor(n_estimators=220, min_samples_leaf=18, random_state=202625, n_jobs=-1),
    ),
    "XLearner": XLearner(
        models=RandomForestRegressor(n_estimators=220, min_samples_leaf=18, random_state=202626, n_jobs=-1),
        cate_models=RandomForestRegressor(n_estimators=220, min_samples_leaf=18, random_state=202627, n_jobs=-1),
        propensity_model=LogisticRegression(max_iter=1_000),
    ),
}

estimator_setup = pd.DataFrame(
    [{"estimator": name, "python_class": estimator.__class__.__name__} for name, estimator in estimators.items()]
)
estimator_setup.to_csv(TABLE_DIR / "13_estimator_setup.csv", index=False)
display(estimator_setup)

	estimator	python_class
0	LinearDML	LinearDML
1	CausalForestDML	CausalForestDML
2	DRLearner	DRLearner
3	SLearner	SLearner
4	TLearner	TLearner
5	XLearner	XLearner

All estimators are now configured. Some use orthogonalization, some use doubly robust pseudo-outcomes, and some use meta-learner outcome-model contrasts.

Fit All Estimators

This cell fits every estimator and records runtime. The benchmark keeps fitting code centralized so every estimator is evaluated consistently.

fitted_estimators = {}
fit_records = []

for name, estimator in estimators.items():
    start_time = time.perf_counter()
    estimator.fit(y_train, t_train, X=X_train)
    elapsed = time.perf_counter() - start_time
    fitted_estimators[name] = estimator
    fit_records.append({"estimator": name, "fit_seconds": elapsed})

fit_runtime = pd.DataFrame(fit_records).sort_values("fit_seconds")
fit_runtime.to_csv(TABLE_DIR / "13_fit_runtime.csv", index=False)
display(fit_runtime)

	estimator	fit_seconds
3	SLearner	0.383530
4	TLearner	0.729670
5	XLearner	1.815952
2	DRLearner	2.173814
0	LinearDML	3.017651
1	CausalForestDML	3.261215

Runtime is not the main scientific metric, but it matters in practice. A slower estimator may be worth it if it improves the decision metric that matters.

Predict Held-Out CATEs

After fitting, each estimator predicts treatment effects on the same held-out rows. We store all predictions in one table so downstream diagnostics are easy to compare.

prediction_records = []
effect_predictions = pd.DataFrame({"true_cate": true_tau_test})

for name, estimator in fitted_estimators.items():
    start_time = time.perf_counter()
    estimated_effect = np.ravel(estimator.effect(X_test, T0=0, T1=1))
    predict_seconds = time.perf_counter() - start_time
    effect_predictions[name] = estimated_effect
    prediction_records.append({"estimator": name, "predict_seconds": predict_seconds})

prediction_runtime = pd.DataFrame(prediction_records).sort_values("predict_seconds")
effect_predictions.to_csv(TABLE_DIR / "13_heldout_cate_predictions.csv", index=False)
prediction_runtime.to_csv(TABLE_DIR / "13_prediction_runtime.csv", index=False)
display(prediction_runtime)
display(effect_predictions.head())

	estimator	predict_seconds
0	LinearDML	0.002862
1	CausalForestDML	0.076335
2	DRLearner	0.097243
3	SLearner	0.103912
4	TLearner	0.169307
5	XLearner	0.191745

	true_cate	LinearDML	CausalForestDML	DRLearner	SLearner	TLearner	XLearner
0	0.376500	0.454522	0.404979	0.504012	0.569757	0.788655	0.506183
1	0.381832	0.066483	0.382831	0.257904	0.067475	0.324037	0.453422
2	0.088619	0.084981	0.281727	-0.087010	0.505844	0.403956	0.215171
3	-0.587649	0.133417	0.089354	0.075430	0.000000	0.330138	0.052100
4	0.436237	0.765672	0.287974	0.207394	0.150250	0.400706	0.139964

The prediction table is the central benchmark artifact. Every later comparison is computed from this shared held-out table.

Main Benchmark Metrics

This cell computes a shared metric set for every estimator: average-effect bias, CATE error, CATE correlation, positive-effect rate, and top-20-percent targeting value.

def summarize_estimator(name, estimate):
    """Compute common held-out benchmark metrics for one estimator."""
    top_n = int(np.floor(0.20 * len(estimate)))
    top_idx = pd.Series(estimate).nlargest(top_n).index
    return {
        "estimator": name,
        "true_ate": true_tau_test.mean(),
        "estimated_ate": estimate.mean(),
        "ate_bias": estimate.mean() - true_tau_test.mean(),
        "absolute_ate_bias": abs(estimate.mean() - true_tau_test.mean()),
        "cate_rmse": np.sqrt(mean_squared_error(true_tau_test, estimate)),
        "cate_mae": mean_absolute_error(true_tau_test, estimate),
        "cate_correlation": np.corrcoef(true_tau_test, estimate)[0, 1],
        "share_estimated_positive": np.mean(estimate > 0),
        "top_20_true_cate_mean": true_tau_test[top_idx].mean(),
        "top_20_estimated_cate_mean": estimate[top_idx].mean(),
        "top_20_true_negative_share": np.mean(true_tau_test[top_idx] < 0),
    }

benchmark_metrics = pd.DataFrame(
    [summarize_estimator(name, effect_predictions[name].to_numpy()) for name in estimators.keys()]
).merge(fit_runtime, on="estimator", how="left").merge(prediction_runtime, on="estimator", how="left")

benchmark_metrics = benchmark_metrics.sort_values("cate_rmse").reset_index(drop=True)
benchmark_metrics.to_csv(TABLE_DIR / "13_benchmark_metrics.csv", index=False)
display(benchmark_metrics)

	estimator	true_ate	estimated_ate	ate_bias	absolute_ate_bias	cate_rmse	cate_mae	cate_correlation	share_estimated_positive	top_20_true_cate_mean	top_20_estimated_cate_mean	top_20_true_negative_share	fit_seconds	predict_seconds
0	XLearner	0.198624	0.273597	0.074973	0.074973	0.274643	0.211071	0.855272	0.694286	0.850097	0.907867	0.000000	1.815952	0.191745
1	SLearner	0.198624	0.308376	0.109752	0.109752	0.301581	0.229460	0.836177	0.793571	0.892449	1.008308	0.000000	0.383530	0.103912
2	DRLearner	0.198624	0.227219	0.028595	0.028595	0.314208	0.244124	0.813792	0.662857	0.856903	0.984641	0.000000	2.173814	0.097243
3	CausalForestDML	0.198624	0.301695	0.103071	0.103071	0.318427	0.245153	0.913833	0.893571	0.895554	0.672556	0.000000	3.261215	0.076335
4	LinearDML	0.198624	0.290684	0.092060	0.092060	0.393217	0.288085	0.761342	0.665714	0.798281	1.154401	0.021429	3.017651	0.002862
5	TLearner	0.198624	0.408943	0.210319	0.210319	0.435578	0.331652	0.719571	0.773571	0.846643	1.153384	0.007143	0.729670	0.169307

The table is the main leaderboard, but it should not be read as a single universal ranking. CATE RMSE, ATE bias, and targeting value can point to different choices.

Metric Ranking Table

A compact rank table helps compare estimators across several objectives. Lower rank is better for each metric.

rank_table = benchmark_metrics[["estimator"]].copy()
rank_table["rmse_rank"] = benchmark_metrics["cate_rmse"].rank(method="min", ascending=True).astype(int)
rank_table["ate_bias_rank"] = benchmark_metrics["absolute_ate_bias"].rank(method="min", ascending=True).astype(int)
rank_table["correlation_rank"] = benchmark_metrics["cate_correlation"].rank(method="min", ascending=False).astype(int)
rank_table["targeting_rank"] = benchmark_metrics["top_20_true_cate_mean"].rank(method="min", ascending=False).astype(int)
rank_table["runtime_rank"] = benchmark_metrics["fit_seconds"].rank(method="min", ascending=True).astype(int)
rank_table["average_rank"] = rank_table[["rmse_rank", "ate_bias_rank", "correlation_rank", "targeting_rank", "runtime_rank"]].mean(axis=1)
rank_table = rank_table.sort_values("average_rank").reset_index(drop=True)

rank_table.to_csv(TABLE_DIR / "13_metric_rank_table.csv", index=False)
display(rank_table)

	estimator	rmse_rank	ate_bias_rank	correlation_rank	targeting_rank	runtime_rank	average_rank
0	XLearner	1	2	2	4	3	2.4
1	SLearner	2	5	3	2	1	2.6
2	DRLearner	3	1	4	3	4	3.0
3	CausalForestDML	4	4	1	1	6	3.2
4	LinearDML	5	3	5	6	5	4.8
5	TLearner	6	6	6	5	2	5.0

The average rank is a simple summary, not a decision rule. A real project should weight metrics based on the use case: estimation accuracy, targeting, or runtime.

CATE RMSE And ATE Bias Plot

This plot shows two different ways an estimator can be good or bad: pointwise CATE recovery and average-effect bias.

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plot_metrics = benchmark_metrics.sort_values("cate_rmse")

sns.barplot(data=plot_metrics, x="cate_rmse", y="estimator", color="#2563eb", ax=axes[0])
axes[0].set_title("Held-Out CATE RMSE")
axes[0].set_xlabel("RMSE versus true CATE")
axes[0].set_ylabel("Estimator")

bias_plot = benchmark_metrics.sort_values("ate_bias")
sns.barplot(data=bias_plot, x="ate_bias", y="estimator", color="#b45309", ax=axes[1])
axes[1].axvline(0, color="#111827", linestyle=":", linewidth=1)
axes[1].set_title("ATE Bias")
axes[1].set_xlabel("Estimated ATE minus true ATE")
axes[1].set_ylabel("")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_rmse_and_ate_bias.png", dpi=160, bbox_inches="tight")
plt.show()

The side-by-side view is useful because average-effect accuracy and row-level recovery are different goals. An estimator can be close on the ATE while still ranking rows poorly.

CATE Recovery Scatterplots

Scatterplots compare true and estimated CATE values directly. The diagonal line marks perfect recovery.

scatter_data = effect_predictions.melt(
    id_vars="true_cate",
    value_vars=list(estimators.keys()),
    var_name="estimator",
    value_name="estimated_cate",
)

g = sns.FacetGrid(scatter_data, col="estimator", col_wrap=3, height=4, sharex=True, sharey=True)
g.map_dataframe(sns.scatterplot, x="true_cate", y="estimated_cate", alpha=0.45, s=20, color="#2563eb")

for ax in g.axes.flat:
    lims = [
        min(scatter_data["true_cate"].min(), scatter_data["estimated_cate"].min()),
        max(scatter_data["true_cate"].max(), scatter_data["estimated_cate"].max()),
    ]
    ax.plot(lims, lims, color="#111827", linestyle="--", linewidth=1)
    ax.axhline(0, color="#64748b", linestyle=":", linewidth=1)
    ax.axvline(0, color="#64748b", linestyle=":", linewidth=1)

g.set_axis_labels("True CATE", "Estimated CATE")
g.fig.suptitle("Held-Out CATE Recovery By Estimator", y=1.03)
g.fig.savefig(FIGURE_DIR / "13_cate_recovery_scatterplots.png", dpi=160, bbox_inches="tight")
plt.show()

The scatterplots show shape differences that a table can hide. Some estimators may compress the range, while others may capture ranking but have level bias.

Ranking And Targeting Comparison

Many applied CATE workflows use estimates for ranking. This cell compares the true CATE among rows selected by each estimator’s top 20 percent.

top_n = int(np.floor(0.20 * len(test_df)))
targeting_rows = []
for name in estimators.keys():
    estimates = effect_predictions[name]
    selected_idx = estimates.nlargest(top_n).index
    selected = test_df.loc[selected_idx]
    targeting_rows.append(
        {
            "estimator": name,
            "selected_rows": len(selected_idx),
            "selected_share": len(selected_idx) / len(test_df),
            "mean_estimated_cate_selected": estimates.loc[selected_idx].mean(),
            "mean_true_cate_selected": test_df.loc[selected_idx, "true_cate"].mean(),
            "share_true_cate_negative_selected": (test_df.loc[selected_idx, "true_cate"] < 0).mean(),
            "high_need_rate_selected": selected["high_need_segment"].mean(),
            "low_friction_rate_selected": selected["low_friction_segment"].mean(),
        }
    )

oracle_idx = test_df["true_cate"].nlargest(top_n).index
targeting_rows.append(
    {
        "estimator": "Oracle true CATE",
        "selected_rows": len(oracle_idx),
        "selected_share": len(oracle_idx) / len(test_df),
        "mean_estimated_cate_selected": np.nan,
        "mean_true_cate_selected": test_df.loc[oracle_idx, "true_cate"].mean(),
        "share_true_cate_negative_selected": (test_df.loc[oracle_idx, "true_cate"] < 0).mean(),
        "high_need_rate_selected": test_df.loc[oracle_idx, "high_need_segment"].mean(),
        "low_friction_rate_selected": test_df.loc[oracle_idx, "low_friction_segment"].mean(),
    }
)

targeting_summary = pd.DataFrame(targeting_rows).sort_values("mean_true_cate_selected", ascending=False)
targeting_summary.to_csv(TABLE_DIR / "13_targeting_summary.csv", index=False)
display(targeting_summary)

	estimator	selected_rows	selected_share	mean_estimated_cate_selected	mean_true_cate_selected	share_true_cate_negative_selected	high_need_rate_selected	low_friction_rate_selected
6	Oracle true CATE	280	0.2	NaN	0.954328	0.000000	0.875000	0.471429
1	CausalForestDML	280	0.2	0.672556	0.895554	0.000000	0.807143	0.475000
3	SLearner	280	0.2	1.008308	0.892449	0.000000	0.835714	0.421429
2	DRLearner	280	0.2	0.984641	0.856903	0.000000	0.753571	0.467857
5	XLearner	280	0.2	0.907867	0.850097	0.000000	0.746429	0.421429
4	TLearner	280	0.2	1.153384	0.846643	0.007143	0.764286	0.446429
0	LinearDML	280	0.2	1.154401	0.798281	0.021429	0.685714	0.432143

The targeting table evaluates a decision-oriented question: which estimator finds high-benefit rows? The oracle row shows the upper benchmark available only in synthetic data.

Targeting Plot

This plot compares the true effect among selected rows. It is a high-signal artifact for prioritization use cases.

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=targeting_summary, x="mean_true_cate_selected", y="estimator", color="#16a34a", ax=ax)
ax.axvline(test_df["true_cate"].mean(), color="#111827", linestyle="--", linewidth=1.3, label="Held-out true ATE")
ax.set_title("Mean True CATE Among Top 20% Selected Rows")
ax.set_xlabel("Mean true CATE among selected rows")
ax.set_ylabel("Estimator")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_targeting_summary.png", dpi=160, bbox_inches="tight")
plt.show()

Estimators with better top-group true effects are better for this specific targeting rule. That may or may not match the ranking by CATE RMSE.

Decile Calibration

Decile calibration checks whether rows with higher estimated CATE also have higher true CATE on average. This is a useful ranking diagnostic.

decile_rows = []
for name in estimators.keys():
    temp = pd.DataFrame({"estimated_cate": effect_predictions[name], "true_cate": true_tau_test})
    temp["decile"] = pd.qcut(temp["estimated_cate"], q=10, labels=False, duplicates="drop") + 1
    decile_summary = (
        temp.groupby("decile", observed=True)
        .agg(
            rows=("true_cate", "size"),
            mean_estimated_cate=("estimated_cate", "mean"),
            mean_true_cate=("true_cate", "mean"),
        )
        .reset_index()
    )
    decile_summary["estimator"] = name
    decile_rows.append(decile_summary)

decile_calibration = pd.concat(decile_rows, ignore_index=True)
decile_calibration.to_csv(TABLE_DIR / "13_decile_calibration.csv", index=False)
display(decile_calibration.head(15))

	decile	rows	mean_estimated_cate	mean_true_cate	estimator
0	1	140	-0.602850	-0.291655	LinearDML
1	2	140	-0.237903	-0.232706	LinearDML
2	3	140	-0.105576	-0.167671	LinearDML
3	4	140	0.019309	-0.069742	LinearDML
4	5	140	0.163088	0.038262	LinearDML
5	6	140	0.304745	0.252068	LinearDML
6	7	140	0.451947	0.379142	LinearDML
7	8	140	0.605282	0.481980	LinearDML
8	9	140	0.863357	0.647574	LinearDML
9	10	140	1.445445	0.948989	LinearDML
10	1	140	-0.063500	-0.464933	CausalForestDML
11	2	140	0.027663	-0.329350	CausalForestDML
12	3	140	0.094289	-0.179139	CausalForestDML
13	4	140	0.170047	-0.044493	CausalForestDML
14	5	140	0.244050	0.089720	CausalForestDML

A well-ranked estimator should show increasing true CATE as the estimated-CATE decile rises. The next plot makes that pattern easier to compare across estimators.

Decile Calibration Plot

This plot shows the mean true CATE within estimated-CATE deciles for each estimator. Steeper upward curves indicate better ranking separation.

fig, ax = plt.subplots(figsize=(11, 6))
sns.lineplot(data=decile_calibration, x="decile", y="mean_true_cate", hue="estimator", marker="o", linewidth=2, ax=ax)
ax.axhline(test_df["true_cate"].mean(), color="#111827", linestyle="--", linewidth=1, label="Held-out true ATE")
ax.set_title("True CATE By Estimated-CATE Decile")
ax.set_xlabel("Estimated-CATE decile")
ax.set_ylabel("Mean true CATE")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_decile_calibration.png", dpi=160, bbox_inches="tight")
plt.show()

The calibration curves show which estimators create useful separation between low-benefit and high-benefit rows. This can matter more than exact row-level CATE values for targeting workflows.

Segment-Level Benchmark

Segment summaries check whether estimators behave differently across important groups. Here we compare regular and high-need rows, crossed with friction level.

segment_eval = test_df[["baseline_need", "friction_score", "high_need_segment", "low_friction_segment", "true_cate"]].copy()
segment_eval["need_segment"] = np.where(segment_eval["high_need_segment"] == 1, "high need", "regular need")
segment_eval["friction_segment"] = np.where(segment_eval["low_friction_segment"] == 1, "low friction", "regular/high friction")
for name in estimators.keys():
    segment_eval[name] = effect_predictions[name]

segment_rows = []
for name in estimators.keys():
    grouped = (
        segment_eval.groupby(["need_segment", "friction_segment"], observed=True)
        .apply(
            lambda g: pd.Series(
                {
                    "rows": len(g),
                    "true_cate_mean": g["true_cate"].mean(),
                    "estimated_cate_mean": g[name].mean(),
                    "segment_bias": g[name].mean() - g["true_cate"].mean(),
                    "segment_rmse": np.sqrt(mean_squared_error(g["true_cate"], g[name])),
                }
            ),
            include_groups=False,
        )
        .reset_index()
    )
    grouped["estimator"] = name
    segment_rows.append(grouped)

segment_benchmark = pd.concat(segment_rows, ignore_index=True)
segment_benchmark.to_csv(TABLE_DIR / "13_segment_benchmark.csv", index=False)
display(segment_benchmark.head(20))

	need_segment	friction_segment	rows	true_cate_mean	estimated_cate_mean	segment_bias	segment_rmse	estimator
0	high need	low friction	151.0	0.845285	0.824273	-0.021013	0.404828	LinearDML
1	high need	regular/high friction	238.0	0.699039	0.798205	0.099166	0.384243	LinearDML
2	regular need	low friction	413.0	0.088037	0.160352	0.072316	0.402211	LinearDML
3	regular need	regular/high friction	598.0	-0.087450	0.043971	0.131420	0.387448	LinearDML
4	high need	low friction	151.0	0.845285	0.614107	-0.231179	0.326789	CausalForestDML
5	high need	regular/high friction	238.0	0.699039	0.548843	-0.150197	0.316374	CausalForestDML
6	regular need	low friction	413.0	0.088037	0.249279	0.161242	0.270059	CausalForestDML
7	regular need	regular/high friction	598.0	-0.087450	0.160646	0.248096	0.346713	CausalForestDML
8	high need	low friction	151.0	0.845285	0.787213	-0.058073	0.268847	DRLearner
9	high need	regular/high friction	238.0	0.699039	0.608299	-0.090740	0.327093	DRLearner
10	regular need	low friction	413.0	0.088037	0.160524	0.072487	0.301348	DRLearner
11	regular need	regular/high friction	598.0	-0.087450	-0.019790	0.067660	0.327999	DRLearner
12	high need	low friction	151.0	0.845285	0.816750	-0.028535	0.201472	SLearner
13	high need	regular/high friction	238.0	0.699039	0.734169	0.035130	0.256452	SLearner
14	regular need	low friction	413.0	0.088037	0.149447	0.061410	0.262271	SLearner
15	regular need	regular/high friction	598.0	-0.087450	0.120305	0.207755	0.359163	SLearner
16	high need	low friction	151.0	0.845285	0.939421	0.094136	0.254808	TLearner
17	high need	regular/high friction	238.0	0.699039	0.852736	0.153697	0.318416	TLearner
18	regular need	low friction	413.0	0.088037	0.258061	0.170024	0.449238	TLearner
19	regular need	regular/high friction	598.0	-0.087450	0.202570	0.290020	0.498050	TLearner

The segment table reveals whether an estimator is consistently accurate or only good in aggregate. Segment failures can be hidden by overall metrics.

Segment Bias Heatmap

This heatmap visualizes segment bias by estimator. Values near zero indicate segment-level average effects are close to truth.

segment_benchmark["segment"] = segment_benchmark["need_segment"] + " / " + segment_benchmark["friction_segment"]
segment_bias_heatmap = segment_benchmark.pivot(index="segment", columns="estimator", values="segment_bias")

fig, ax = plt.subplots(figsize=(11, 5))
sns.heatmap(segment_bias_heatmap, annot=True, fmt=".3f", cmap="vlag", center=0, linewidths=0.5, ax=ax)
ax.set_title("Segment-Level CATE Bias By Estimator")
ax.set_xlabel("Estimator")
ax.set_ylabel("Segment")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_segment_bias_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

The heatmap shows where estimators overstate or understate effects. This is often the start of a deeper model diagnostics conversation.

Runtime Versus Accuracy

This plot compares fit runtime with CATE RMSE. It helps students see the practical tradeoff between computation and accuracy.

fig, ax = plt.subplots(figsize=(9, 6))
sns.scatterplot(data=benchmark_metrics, x="fit_seconds", y="cate_rmse", hue="estimator", s=120, ax=ax)
for _, row in benchmark_metrics.iterrows():
    ax.text(row["fit_seconds"] * 1.01, row["cate_rmse"], row["estimator"], fontsize=9, va="center")
ax.set_title("Runtime Versus Held-Out CATE RMSE")
ax.set_xlabel("Fit time in seconds")
ax.set_ylabel("CATE RMSE")
ax.legend_.remove()
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_runtime_vs_accuracy.png", dpi=160, bbox_inches="tight")
plt.show()

Fast models are not automatically worse, and slow models are not automatically better. Runtime is one practical constraint among several.

Robustness Across Bootstrap Resamples

A benchmark can be sensitive to one particular test sample. This lightweight bootstrap resamples held-out rows and recomputes the CATE RMSE for each estimator without refitting models.

bootstrap_rows = []
n_bootstrap = 300
for name in estimators.keys():
    estimates = effect_predictions[name].to_numpy()
    bootstrap_rmse = []
    for _ in range(n_bootstrap):
        sample_idx = rng.choice(np.arange(len(true_tau_test)), size=len(true_tau_test), replace=True)
        bootstrap_rmse.append(np.sqrt(mean_squared_error(true_tau_test[sample_idx], estimates[sample_idx])))
    bootstrap_rows.append(
        {
            "estimator": name,
            "rmse_mean": np.mean(bootstrap_rmse),
            "rmse_ci_lower": np.quantile(bootstrap_rmse, 0.025),
            "rmse_ci_upper": np.quantile(bootstrap_rmse, 0.975),
        }
    )

bootstrap_rmse_summary = pd.DataFrame(bootstrap_rows).sort_values("rmse_mean")
bootstrap_rmse_summary.to_csv(TABLE_DIR / "13_bootstrap_rmse_summary.csv", index=False)
display(bootstrap_rmse_summary)

	estimator	rmse_mean	rmse_ci_lower	rmse_ci_upper
5	XLearner	0.275009	0.262328	0.286428
3	SLearner	0.301513	0.287719	0.314949
2	DRLearner	0.313540	0.300542	0.325121
1	CausalForestDML	0.318488	0.305658	0.331090
0	LinearDML	0.392984	0.373317	0.414053
4	TLearner	0.436275	0.418239	0.456594

The bootstrap intervals summarize test-sample uncertainty in the RMSE comparison. They do not include refitting uncertainty, but they help avoid overreading tiny leaderboard gaps.

Bootstrap RMSE Plot

The plot shows RMSE means with bootstrap intervals. Overlapping intervals are a cue that estimator differences may not be decisive.

plot_boot = bootstrap_rmse_summary.sort_values("rmse_mean")
fig, ax = plt.subplots(figsize=(10, 5))
ax.errorbar(
    x=plot_boot["rmse_mean"],
    y=plot_boot["estimator"],
    xerr=[plot_boot["rmse_mean"] - plot_boot["rmse_ci_lower"], plot_boot["rmse_ci_upper"] - plot_boot["rmse_mean"]],
    fmt="o",
    color="#2563eb",
    ecolor="#64748b",
    capsize=4,
)
ax.set_title("Bootstrap Test-Sample RMSE Intervals")
ax.set_xlabel("CATE RMSE")
ax.set_ylabel("Estimator")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "13_bootstrap_rmse_intervals.png", dpi=160, bbox_inches="tight")
plt.show()

The intervals make the benchmark more honest. A small apparent win may not matter if it sits inside the uncertainty band.

Estimator Selection Guidance

This table translates the benchmark into practical guidance. The right estimator depends on whether the goal is interpretability, nonlinear recovery, targeting, runtime, or robustness.

selection_guidance = pd.DataFrame(
    [
        {
            "use_case": "Need a transparent baseline",
            "estimator_to_try_first": "LinearDML",
            "what_to_check": "ATE bias, coefficient signs, and whether nonlinear estimators improve CATE recovery.",
        },
        {
            "use_case": "Expect nonlinear heterogeneity",
            "estimator_to_try_first": "CausalForestDML or forest-based DRLearner",
            "what_to_check": "CATE RMSE, decile calibration, and segment behavior.",
        },
        {
            "use_case": "Prioritization or targeting",
            "estimator_to_try_first": "The estimator with best top-group true effect in simulation or strongest validation proxy in real data.",
            "what_to_check": "Top-decile value, calibration curves, and stability across resamples.",
        },
        {
            "use_case": "Fast benchmark baseline",
            "estimator_to_try_first": "SLearner or TLearner",
            "what_to_check": "Whether simple outcome-model contrasts are competitive before using heavier methods.",
        },
        {
            "use_case": "Imbalanced treatment assignment",
            "estimator_to_try_first": "XLearner or DRLearner",
            "what_to_check": "Overlap, propensity quality, and selected-group composition.",
        },
        {
            "use_case": "Final reporting",
            "estimator_to_try_first": "Do not rely on one estimator only",
            "what_to_check": "Agreement across families, segment diagnostics, uncertainty, and sensitivity to assumptions.",
        },
    ]
)

selection_guidance.to_csv(TABLE_DIR / "13_estimator_selection_guidance.csv", index=False)
display(selection_guidance)

	use_case	estimator_to_try_first	what_to_check
0	Need a transparent baseline	LinearDML	ATE bias, coefficient signs, and whether nonli...
1	Expect nonlinear heterogeneity	CausalForestDML or forest-based DRLearner	CATE RMSE, decile calibration, and segment beh...
2	Prioritization or targeting	The estimator with best top-group true effect ...	Top-decile value, calibration curves, and stab...
3	Fast benchmark baseline	SLearner or TLearner	Whether simple outcome-model contrasts are com...
4	Imbalanced treatment assignment	XLearner or DRLearner	Overlap, propensity quality, and selected-grou...
5	Final reporting	Do not rely on one estimator only	Agreement across families, segment diagnostics...

The guidance keeps the benchmark from becoming a mechanical leaderboard. Estimator choice should reflect the causal question and the decision workflow.

Final Benchmark Checklist

This checklist summarizes the steps needed for a fair estimator comparison in a real analysis.

benchmark_checklist = pd.DataFrame(
    [
        {
            "step": "Define the estimand",
            "why_it_matters": "Estimator comparison is meaningless if methods are answering different causal questions.",
        },
        {
            "step": "Use one split and one feature set",
            "why_it_matters": "Preprocessing differences can masquerade as estimator differences.",
        },
        {
            "step": "Check overlap and balance first",
            "why_it_matters": "Estimator sophistication cannot fix severe support problems.",
        },
        {
            "step": "Evaluate average and heterogeneous metrics",
            "why_it_matters": "ATE accuracy and CATE ranking are different objectives.",
        },
        {
            "step": "Inspect calibration and segments",
            "why_it_matters": "Overall metrics can hide subgroup failures.",
        },
        {
            "step": "Track runtime and complexity",
            "why_it_matters": "The best estimator must also be maintainable and rerunnable.",
        },
        {
            "step": "Avoid declaring universal winners",
            "why_it_matters": "The best choice depends on data support, assumptions, and the downstream decision.",
        },
    ]
)

benchmark_checklist.to_csv(TABLE_DIR / "13_benchmark_checklist.csv", index=False)
display(benchmark_checklist)

	step	why_it_matters
0	Define the estimand	Estimator comparison is meaningless if methods...
1	Use one split and one feature set	Preprocessing differences can masquerade as es...
2	Check overlap and balance first	Estimator sophistication cannot fix severe sup...
3	Evaluate average and heterogeneous metrics	ATE accuracy and CATE ranking are different ob...
4	Inspect calibration and segments	Overall metrics can hide subgroup failures.
5	Track runtime and complexity	The best estimator must also be maintainable a...
6	Avoid declaring universal winners	The best choice depends on data support, assum...

The checklist is the portable lesson from this notebook. Use benchmarks to learn estimator behavior, not to crown a universal champion.

Summary

This notebook compared six EconML estimators on the same nonlinear binary-treatment ground truth.

The main lessons are:

Estimator comparisons should use the same split, same features, and same metrics.
ATE bias, CATE RMSE, CATE correlation, targeting value, and runtime answer different questions.
Orthogonal estimators and meta-learners can excel under different conditions.
Decile calibration is useful when treatment-effect estimates will be used for ranking.
Segment diagnostics can reveal failures that aggregate metrics hide.
Bootstrap test-sample intervals help prevent overreading tiny leaderboard gaps.
The right estimator is the one that best supports the causal question and downstream decision.

The next tutorial combines many of these ingredients into an end-to-end case study.