A causal inference library gives us many estimators, but the important practical question is not “which estimator sounds most advanced?” The important question is: which estimator is best matched to the treatment assignment process, outcome structure, sample size, overlap, and decision goal?
This notebook builds a controlled benchmark where every estimator sees the same data, the same train-test split, and the same evaluation metrics. Because the data are synthetic, we know the true CATE for every held-out row. That lets us compare estimators in a way that real observational data cannot.
We will compare six EconML estimators:
LinearDML
CausalForestDML
DRLearner
SLearner
TLearner
XLearner
The benchmark evaluates more than CATE RMSE. We also examine ATE bias, CATE ranking quality, top-group targeting value, decile calibration, segment behavior, and runtime. The goal is to build a habit of comparing estimators across the decision surface, not just one leaderboard metric.
Learning Goals
By the end of this notebook, you should be able to:
design a fair benchmark for heterogeneous treatment-effect estimators;
keep the data split, features, and metrics identical across estimators;
compare orthogonal estimators with meta-learners;
evaluate CATE recovery, ATE bias, ranking quality, and policy-style targeting value;
diagnose estimator behavior by CATE decile and by business-relevant segment;
explain why the “best” estimator depends on the use case, not only a single score.
Tutorial Flow
The notebook has five parts:
Create a nonlinear binary-treatment teaching dataset with confounded assignment and known CATE.
Run basic diagnostics: treatment balance, propensity overlap, and true effect distribution.
Fit six EconML estimators on the same training data.
Compare held-out performance across recovery, ranking, calibration, targeting, and runtime.
Summarize how to choose an estimator based on the observed benchmark pattern.
Every result-producing code cell is followed by a short discussion so the benchmark reads like an analysis, not just a list of tables.
Setup
This cell imports the notebook dependencies, creates output folders, sets plotting defaults, and suppresses harmless display warnings. All code remains visible so students can rerun and modify the benchmark.
from pathlib import Pathimport osimport timeimport warnings# Suppress optional widget warnings that can appear while importing EconML in headless notebook runs.warnings.filterwarnings("ignore", message="IProgress not found.*")# Keep Matplotlib cache files in a writable location during notebook execution.os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib")import econmlimport matplotlib.pyplot as pltfrom matplotlib.ticker import PercentFormatterimport numpy as npimport pandas as pdimport seaborn as snsfrom IPython.display import displayfrom scipy.special import expitfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import PolynomialFeaturesfrom econml.dml import CausalForestDML, LinearDMLfrom econml.dr import DRLearnerfrom econml.metalearners import SLearner, TLearner, XLearnerwarnings.filterwarnings("ignore", message="X does not have valid feature names.*", category=UserWarning)warnings.filterwarnings("ignore", message="Not all column names are strings.*", category=UserWarning)warnings.filterwarnings("ignore", message="Co-variance matrix is underdetermined.*", category=UserWarning)warnings.filterwarnings("ignore", category=FutureWarning)sns.set_theme(style="whitegrid", context="notebook")plt.rcParams["figure.figsize"] = (10, 6)plt.rcParams["axes.titleweight"] ="bold"plt.rcParams["axes.labelsize"] =11def find_project_root(start=None):"""Find the repository root from either the repo or a nested notebook folder.""" start = Path.cwd() if start isNoneelse Path(start)for candidate in [start, *start.parents]:if (candidate /"pyproject.toml").exists() and (candidate /"notebooks").exists():return candidatereturn Path.cwd()PROJECT_ROOT = find_project_root()NOTEBOOK_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"econml"OUTPUT_DIR = NOTEBOOK_DIR /"outputs"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)rng = np.random.default_rng(202613)print(f"Project root: {PROJECT_ROOT}")print(f"EconML version: {econml.__version__}")print(f"Figures will be saved to: {FIGURE_DIR.relative_to(PROJECT_ROOT)}")print(f"Tables will be saved to: {TABLE_DIR.relative_to(PROJECT_ROOT)}")
Project root: /home/apex/Documents/ranking_sys
EconML version: 0.16.0
Figures will be saved to: notebooks/tutorials/econml/outputs/figures
Tables will be saved to: notebooks/tutorials/econml/outputs/tables
The environment is ready. All artifacts from this benchmark use the 13_ prefix.
Benchmark Design Principles
Before creating data, we define what “fair comparison” means. The estimators differ internally, but the benchmark should hold the external setup fixed.
benchmark_principles = pd.DataFrame( [ {"principle": "Same data split","how_this_notebook_applies_it": "Every estimator trains on the same training rows and is evaluated on the same held-out rows.","why_it_matters": "Different splits can make estimator differences look larger or smaller than they are.", }, {"principle": "Same feature set","how_this_notebook_applies_it": "All estimators receive the same observed pre-treatment covariates.","why_it_matters": "Feature differences should not be confused with estimator differences.", }, {"principle": "Multiple metrics","how_this_notebook_applies_it": "We compare ATE bias, CATE RMSE, correlation, targeting value, calibration, and runtime.","why_it_matters": "One metric rarely captures all decision needs.", }, {"principle": "Ground-truth teaching data","how_this_notebook_applies_it": "The synthetic data include true CATE for held-out evaluation.","why_it_matters": "Real data cannot directly validate row-level CATE recovery.", }, {"principle": "Decision-aware ranking","how_this_notebook_applies_it": "We evaluate top-group true effects, not only pointwise errors.","why_it_matters": "Many CATE workflows are used for prioritization rather than exact prediction.", }, ])benchmark_principles.to_csv(TABLE_DIR /"13_benchmark_principles.csv", index=False)display(benchmark_principles)
principle
how_this_notebook_applies_it
why_it_matters
0
Same data split
Every estimator trains on the same training ro...
Different splits can make estimator difference...
1
Same feature set
All estimators receive the same observed pre-t...
Feature differences should not be confused wit...
2
Multiple metrics
We compare ATE bias, CATE RMSE, correlation, t...
One metric rarely captures all decision needs.
3
Ground-truth teaching data
The synthetic data include true CATE for held-...
Real data cannot directly validate row-level C...
4
Decision-aware ranking
We evaluate top-group true effects, not only p...
Many CATE workflows are used for prioritizatio...
These principles keep the benchmark honest. We want estimator differences to come from estimator behavior, not from accidental differences in preprocessing or evaluation.
Estimator Map
This table summarizes the estimator families included in the benchmark. The point is not to declare a universal winner; it is to make each estimator’s modeling strategy explicit.
estimator_map = pd.DataFrame( [ {"estimator": "LinearDML","family": "orthogonal DML","core_idea": "Residualize outcome and treatment, then fit a structured final CATE model.","expected_strength": "Good when effect heterogeneity is fairly smooth or well represented by chosen features.", }, {"estimator": "CausalForestDML","family": "orthogonal forest","core_idea": "Use forest splits to learn nonlinear heterogeneous effects with orthogonalization.","expected_strength": "Good for nonlinear heterogeneity and segment discovery.", }, {"estimator": "DRLearner","family": "doubly robust","core_idea": "Combine outcome and propensity nuisance models into a pseudo-outcome for CATE learning.","expected_strength": "Good when either propensity or outcome modeling is strong enough to stabilize estimates.", }, {"estimator": "SLearner","family": "meta-learner","core_idea": "Fit one outcome model using treatment as a feature, then contrast predicted outcomes.","expected_strength": "Simple baseline that can work when treatment interactions are learned well.", }, {"estimator": "TLearner","family": "meta-learner","core_idea": "Fit separate outcome models for treated and control rows, then subtract predictions.","expected_strength": "Useful when treated and control outcome surfaces differ strongly and both groups have support.", }, {"estimator": "XLearner","family": "meta-learner","core_idea": "Impute treatment effects in each arm and combine them using propensity information.","expected_strength": "Often useful when treatment groups are imbalanced or effects are heterogeneous.", }, ])estimator_map.to_csv(TABLE_DIR /"13_estimator_map.csv", index=False)display(estimator_map)
estimator
family
core_idea
expected_strength
0
LinearDML
orthogonal DML
Residualize outcome and treatment, then fit a ...
Good when effect heterogeneity is fairly smoot...
1
CausalForestDML
orthogonal forest
Use forest splits to learn nonlinear heterogen...
Good for nonlinear heterogeneity and segment d...
2
DRLearner
doubly robust
Combine outcome and propensity nuisance models...
Good when either propensity or outcome modelin...
3
SLearner
meta-learner
Fit one outcome model using treatment as a fea...
Simple baseline that can work when treatment i...
4
TLearner
meta-learner
Fit separate outcome models for treated and co...
Useful when treated and control outcome surfac...
5
XLearner
meta-learner
Impute treatment effects in each arm and combi...
Often useful when treatment groups are imbalan...
The estimators differ in how they use nuisance models, treatment assignment, and outcome surfaces. That is why the same dataset can favor different methods under different metrics.
Teaching Data Design
The synthetic data include nonlinear treatment effects, confounded binary treatment assignment, and realistic overlap. Treatment assignment depends on observed covariates, so naive treated-control comparisons will be biased.
The first rows show the complete teaching dataset. The models will use observed covariates, treatment, and outcome; the true CATE and true propensity are kept only for diagnostics.
Field Dictionary
This table describes the fields used in the benchmark. Clear field roles help students separate observed inputs from teaching-only truth.
field_dictionary = pd.DataFrame( [ ("baseline_need", "Observed covariate", "Pre-treatment demand or need signal."), ("prior_engagement", "Observed covariate", "Pre-treatment engagement signal."), ("friction_score", "Observed covariate", "Higher values mean more friction."), ("content_affinity", "Observed covariate", "Match between row and content or offer."), ("novelty_preference", "Observed covariate", "Preference for exploratory or new experiences."), ("price_sensitivity", "Observed covariate", "Sensitivity to cost, effort, or inconvenience."), ("account_tenure", "Observed covariate", "Age of the account or relationship in weeks."), ("region_risk", "Observed covariate", "Binary marker for lower baseline outcome regions."), ("device_stability", "Observed covariate", "Technical stability or reliability signal."), ("high_need_segment", "Observed covariate", "Binary segment derived from baseline need."), ("low_friction_segment", "Observed covariate", "Binary segment derived from friction score."), ("treatment", "Treatment", "Binary intervention assignment."), ("outcome", "Outcome", "Observed post-treatment outcome."), ("true_propensity", "Teaching-only truth", "Known treatment probability from the synthetic assignment process."), ("true_cate", "Teaching-only truth", "Known conditional treatment effect for each row."), ("baseline_outcome", "Teaching-only truth", "Potential baseline outcome component before treatment effect and noise."), ], columns=["field", "role", "description"],)field_dictionary.to_csv(TABLE_DIR /"13_field_dictionary.csv", index=False)display(field_dictionary)
field
role
description
0
baseline_need
Observed covariate
Pre-treatment demand or need signal.
1
prior_engagement
Observed covariate
Pre-treatment engagement signal.
2
friction_score
Observed covariate
Higher values mean more friction.
3
content_affinity
Observed covariate
Match between row and content or offer.
4
novelty_preference
Observed covariate
Preference for exploratory or new experiences.
5
price_sensitivity
Observed covariate
Sensitivity to cost, effort, or inconvenience.
6
account_tenure
Observed covariate
Age of the account or relationship in weeks.
7
region_risk
Observed covariate
Binary marker for lower baseline outcome regions.
8
device_stability
Observed covariate
Technical stability or reliability signal.
9
high_need_segment
Observed covariate
Binary segment derived from baseline need.
10
low_friction_segment
Observed covariate
Binary segment derived from friction score.
11
treatment
Treatment
Binary intervention assignment.
12
outcome
Outcome
Observed post-treatment outcome.
13
true_propensity
Teaching-only truth
Known treatment probability from the synthetic...
14
true_cate
Teaching-only truth
Known conditional treatment effect for each row.
15
baseline_outcome
Teaching-only truth
Potential baseline outcome component before tr...
The teaching-only fields make benchmarking possible. Real applications usually cannot compute CATE RMSE because the true row-level treatment effect is unknown.
Basic Data Summary
This cell summarizes the treatment rate, outcome distribution, true effect distribution, and overlap range. These checks establish the difficulty of the benchmark before fitting any estimator.
The true CATE has meaningful spread, so estimator ranking and heterogeneity recovery matter. This is not just an average-effect problem.
True CATE Distribution
The true CATE distribution gives a visual sense of the benchmark target. Estimators should recover both the level and ranking of this distribution on held-out data.
The distribution contains low-benefit and high-benefit regions. A good estimator should not only estimate the mean, but also identify the right tail well enough for targeting.
Naive Treated-Control Difference
Before fitting causal estimators, we compute the raw treated-control difference. This is not a causal estimate because treatment assignment is confounded.
The raw difference mixes true treatment effect with selection into treatment. The benchmark therefore needs estimators that adjust for observed confounding.
Covariate Balance
This cell computes standardized mean differences between treated and control rows. Large imbalance tells us where adjustment is needed.
The imbalance pattern confirms that treatment assignment is observational. This creates a meaningful benchmark for methods that model propensity, outcomes, or both.
Balance Plot
The plot makes the magnitude and direction of imbalance easier to scan than the table alone.
The treated and control distributions overlap, but treatment is clearly not random. That is the sweet spot for a useful teaching benchmark: confounding exists, but overlap is not catastrophically bad.
Train-Test Split
We split the data once and reuse the split for every estimator. Stratifying on treatment keeps treatment rates similar across training and test data.
The shared matrices are the foundation of a fair comparison. From here on, estimator differences are not due to different features or splits.
Nuisance Diagnostics
Before comparing CATE estimators, we check whether basic models can predict treatment and outcome. Strong nuisance signal is expected because treatment is confounded and the outcome is structured.
The propensity model can predict treatment assignment, so confounding is present. The outcome model also has signal, which should help estimators that rely on outcome nuisance models.
Define Estimators
This cell creates all estimators in the benchmark. The hyperparameters are intentionally moderate so the notebook runs quickly while still showing meaningful differences.
All estimators are now configured. Some use orthogonalization, some use doubly robust pseudo-outcomes, and some use meta-learner outcome-model contrasts.
Fit All Estimators
This cell fits every estimator and records runtime. The benchmark keeps fitting code centralized so every estimator is evaluated consistently.
Runtime is not the main scientific metric, but it matters in practice. A slower estimator may be worth it if it improves the decision metric that matters.
Predict Held-Out CATEs
After fitting, each estimator predicts treatment effects on the same held-out rows. We store all predictions in one table so downstream diagnostics are easy to compare.
The prediction table is the central benchmark artifact. Every later comparison is computed from this shared held-out table.
Main Benchmark Metrics
This cell computes a shared metric set for every estimator: average-effect bias, CATE error, CATE correlation, positive-effect rate, and top-20-percent targeting value.
def summarize_estimator(name, estimate):"""Compute common held-out benchmark metrics for one estimator.""" top_n =int(np.floor(0.20*len(estimate))) top_idx = pd.Series(estimate).nlargest(top_n).indexreturn {"estimator": name,"true_ate": true_tau_test.mean(),"estimated_ate": estimate.mean(),"ate_bias": estimate.mean() - true_tau_test.mean(),"absolute_ate_bias": abs(estimate.mean() - true_tau_test.mean()),"cate_rmse": np.sqrt(mean_squared_error(true_tau_test, estimate)),"cate_mae": mean_absolute_error(true_tau_test, estimate),"cate_correlation": np.corrcoef(true_tau_test, estimate)[0, 1],"share_estimated_positive": np.mean(estimate >0),"top_20_true_cate_mean": true_tau_test[top_idx].mean(),"top_20_estimated_cate_mean": estimate[top_idx].mean(),"top_20_true_negative_share": np.mean(true_tau_test[top_idx] <0), }benchmark_metrics = pd.DataFrame( [summarize_estimator(name, effect_predictions[name].to_numpy()) for name in estimators.keys()]).merge(fit_runtime, on="estimator", how="left").merge(prediction_runtime, on="estimator", how="left")benchmark_metrics = benchmark_metrics.sort_values("cate_rmse").reset_index(drop=True)benchmark_metrics.to_csv(TABLE_DIR /"13_benchmark_metrics.csv", index=False)display(benchmark_metrics)
estimator
true_ate
estimated_ate
ate_bias
absolute_ate_bias
cate_rmse
cate_mae
cate_correlation
share_estimated_positive
top_20_true_cate_mean
top_20_estimated_cate_mean
top_20_true_negative_share
fit_seconds
predict_seconds
0
XLearner
0.198624
0.273597
0.074973
0.074973
0.274643
0.211071
0.855272
0.694286
0.850097
0.907867
0.000000
1.815952
0.191745
1
SLearner
0.198624
0.308376
0.109752
0.109752
0.301581
0.229460
0.836177
0.793571
0.892449
1.008308
0.000000
0.383530
0.103912
2
DRLearner
0.198624
0.227219
0.028595
0.028595
0.314208
0.244124
0.813792
0.662857
0.856903
0.984641
0.000000
2.173814
0.097243
3
CausalForestDML
0.198624
0.301695
0.103071
0.103071
0.318427
0.245153
0.913833
0.893571
0.895554
0.672556
0.000000
3.261215
0.076335
4
LinearDML
0.198624
0.290684
0.092060
0.092060
0.393217
0.288085
0.761342
0.665714
0.798281
1.154401
0.021429
3.017651
0.002862
5
TLearner
0.198624
0.408943
0.210319
0.210319
0.435578
0.331652
0.719571
0.773571
0.846643
1.153384
0.007143
0.729670
0.169307
The table is the main leaderboard, but it should not be read as a single universal ranking. CATE RMSE, ATE bias, and targeting value can point to different choices.
Metric Ranking Table
A compact rank table helps compare estimators across several objectives. Lower rank is better for each metric.
The average rank is a simple summary, not a decision rule. A real project should weight metrics based on the use case: estimation accuracy, targeting, or runtime.
CATE RMSE And ATE Bias Plot
This plot shows two different ways an estimator can be good or bad: pointwise CATE recovery and average-effect bias.
The side-by-side view is useful because average-effect accuracy and row-level recovery are different goals. An estimator can be close on the ATE while still ranking rows poorly.
CATE Recovery Scatterplots
Scatterplots compare true and estimated CATE values directly. The diagonal line marks perfect recovery.
The scatterplots show shape differences that a table can hide. Some estimators may compress the range, while others may capture ranking but have level bias.
Ranking And Targeting Comparison
Many applied CATE workflows use estimates for ranking. This cell compares the true CATE among rows selected by each estimator’s top 20 percent.
The targeting table evaluates a decision-oriented question: which estimator finds high-benefit rows? The oracle row shows the upper benchmark available only in synthetic data.
Targeting Plot
This plot compares the true effect among selected rows. It is a high-signal artifact for prioritization use cases.
fig, ax = plt.subplots(figsize=(10, 6))sns.barplot(data=targeting_summary, x="mean_true_cate_selected", y="estimator", color="#16a34a", ax=ax)ax.axvline(test_df["true_cate"].mean(), color="#111827", linestyle="--", linewidth=1.3, label="Held-out true ATE")ax.set_title("Mean True CATE Among Top 20% Selected Rows")ax.set_xlabel("Mean true CATE among selected rows")ax.set_ylabel("Estimator")ax.legend()plt.tight_layout()fig.savefig(FIGURE_DIR /"13_targeting_summary.png", dpi=160, bbox_inches="tight")plt.show()
Estimators with better top-group true effects are better for this specific targeting rule. That may or may not match the ranking by CATE RMSE.
Decile Calibration
Decile calibration checks whether rows with higher estimated CATE also have higher true CATE on average. This is a useful ranking diagnostic.
A well-ranked estimator should show increasing true CATE as the estimated-CATE decile rises. The next plot makes that pattern easier to compare across estimators.
Decile Calibration Plot
This plot shows the mean true CATE within estimated-CATE deciles for each estimator. Steeper upward curves indicate better ranking separation.
The calibration curves show which estimators create useful separation between low-benefit and high-benefit rows. This can matter more than exact row-level CATE values for targeting workflows.
Segment-Level Benchmark
Segment summaries check whether estimators behave differently across important groups. Here we compare regular and high-need rows, crossed with friction level.
The heatmap shows where estimators overstate or understate effects. This is often the start of a deeper model diagnostics conversation.
Runtime Versus Accuracy
This plot compares fit runtime with CATE RMSE. It helps students see the practical tradeoff between computation and accuracy.
fig, ax = plt.subplots(figsize=(9, 6))sns.scatterplot(data=benchmark_metrics, x="fit_seconds", y="cate_rmse", hue="estimator", s=120, ax=ax)for _, row in benchmark_metrics.iterrows(): ax.text(row["fit_seconds"] *1.01, row["cate_rmse"], row["estimator"], fontsize=9, va="center")ax.set_title("Runtime Versus Held-Out CATE RMSE")ax.set_xlabel("Fit time in seconds")ax.set_ylabel("CATE RMSE")ax.legend_.remove()plt.tight_layout()fig.savefig(FIGURE_DIR /"13_runtime_vs_accuracy.png", dpi=160, bbox_inches="tight")plt.show()
Fast models are not automatically worse, and slow models are not automatically better. Runtime is one practical constraint among several.
Robustness Across Bootstrap Resamples
A benchmark can be sensitive to one particular test sample. This lightweight bootstrap resamples held-out rows and recomputes the CATE RMSE for each estimator without refitting models.
The bootstrap intervals summarize test-sample uncertainty in the RMSE comparison. They do not include refitting uncertainty, but they help avoid overreading tiny leaderboard gaps.
Bootstrap RMSE Plot
The plot shows RMSE means with bootstrap intervals. Overlapping intervals are a cue that estimator differences may not be decisive.
The intervals make the benchmark more honest. A small apparent win may not matter if it sits inside the uncertainty band.
Estimator Selection Guidance
This table translates the benchmark into practical guidance. The right estimator depends on whether the goal is interpretability, nonlinear recovery, targeting, runtime, or robustness.
selection_guidance = pd.DataFrame( [ {"use_case": "Need a transparent baseline","estimator_to_try_first": "LinearDML","what_to_check": "ATE bias, coefficient signs, and whether nonlinear estimators improve CATE recovery.", }, {"use_case": "Expect nonlinear heterogeneity","estimator_to_try_first": "CausalForestDML or forest-based DRLearner","what_to_check": "CATE RMSE, decile calibration, and segment behavior.", }, {"use_case": "Prioritization or targeting","estimator_to_try_first": "The estimator with best top-group true effect in simulation or strongest validation proxy in real data.","what_to_check": "Top-decile value, calibration curves, and stability across resamples.", }, {"use_case": "Fast benchmark baseline","estimator_to_try_first": "SLearner or TLearner","what_to_check": "Whether simple outcome-model contrasts are competitive before using heavier methods.", }, {"use_case": "Imbalanced treatment assignment","estimator_to_try_first": "XLearner or DRLearner","what_to_check": "Overlap, propensity quality, and selected-group composition.", }, {"use_case": "Final reporting","estimator_to_try_first": "Do not rely on one estimator only","what_to_check": "Agreement across families, segment diagnostics, uncertainty, and sensitivity to assumptions.", }, ])selection_guidance.to_csv(TABLE_DIR /"13_estimator_selection_guidance.csv", index=False)display(selection_guidance)
use_case
estimator_to_try_first
what_to_check
0
Need a transparent baseline
LinearDML
ATE bias, coefficient signs, and whether nonli...
1
Expect nonlinear heterogeneity
CausalForestDML or forest-based DRLearner
CATE RMSE, decile calibration, and segment beh...
2
Prioritization or targeting
The estimator with best top-group true effect ...
Top-decile value, calibration curves, and stab...
3
Fast benchmark baseline
SLearner or TLearner
Whether simple outcome-model contrasts are com...
4
Imbalanced treatment assignment
XLearner or DRLearner
Overlap, propensity quality, and selected-grou...
5
Final reporting
Do not rely on one estimator only
Agreement across families, segment diagnostics...
The guidance keeps the benchmark from becoming a mechanical leaderboard. Estimator choice should reflect the causal question and the decision workflow.
Final Benchmark Checklist
This checklist summarizes the steps needed for a fair estimator comparison in a real analysis.
benchmark_checklist = pd.DataFrame( [ {"step": "Define the estimand","why_it_matters": "Estimator comparison is meaningless if methods are answering different causal questions.", }, {"step": "Use one split and one feature set","why_it_matters": "Preprocessing differences can masquerade as estimator differences.", }, {"step": "Check overlap and balance first","why_it_matters": "Estimator sophistication cannot fix severe support problems.", }, {"step": "Evaluate average and heterogeneous metrics","why_it_matters": "ATE accuracy and CATE ranking are different objectives.", }, {"step": "Inspect calibration and segments","why_it_matters": "Overall metrics can hide subgroup failures.", }, {"step": "Track runtime and complexity","why_it_matters": "The best estimator must also be maintainable and rerunnable.", }, {"step": "Avoid declaring universal winners","why_it_matters": "The best choice depends on data support, assumptions, and the downstream decision.", }, ])benchmark_checklist.to_csv(TABLE_DIR /"13_benchmark_checklist.csv", index=False)display(benchmark_checklist)
step
why_it_matters
0
Define the estimand
Estimator comparison is meaningless if methods...
1
Use one split and one feature set
Preprocessing differences can masquerade as es...
2
Check overlap and balance first
Estimator sophistication cannot fix severe sup...
3
Evaluate average and heterogeneous metrics
ATE accuracy and CATE ranking are different ob...
4
Inspect calibration and segments
Overall metrics can hide subgroup failures.
5
Track runtime and complexity
The best estimator must also be maintainable a...
6
Avoid declaring universal winners
The best choice depends on data support, assum...
The checklist is the portable lesson from this notebook. Use benchmarks to learn estimator behavior, not to crown a universal champion.
Summary
This notebook compared six EconML estimators on the same nonlinear binary-treatment ground truth.
The main lessons are:
Estimator comparisons should use the same split, same features, and same metrics.
ATE bias, CATE RMSE, CATE correlation, targeting value, and runtime answer different questions.
Orthogonal estimators and meta-learners can excel under different conditions.
Decile calibration is useful when treatment-effect estimates will be used for ranking.
Segment diagnostics can reveal failures that aggregate metrics hide.
Bootstrap test-sample intervals help prevent overreading tiny leaderboard gaps.
The right estimator is the one that best supports the causal question and downstream decision.
The next tutorial combines many of these ingredients into an end-to-end case study.