EconML Tutorial 14: End-To-End Case Study

This notebook is the capstone workflow for the EconML tutorial sequence. Earlier notebooks focused on individual estimators, diagnostics, or decision steps. Here we connect those pieces into one applied causal analysis: define the decision, inspect the observational data, check assumptions, fit nuisance models, estimate heterogeneous effects, translate effects into a targeting policy, and prepare results that are easy to explain.

The setting is synthetic but realistic: a digital product team is deciding which users should receive a proactive onboarding bundle. Because the data are simulated, we know the true individual treatment effects. That hidden truth is used only for teaching and validation; it is never used to fit the EconML estimators.

Learning Goals

By the end of this notebook, you should be able to:

Convert an applied decision into treatment, outcome, controls, effect modifiers, and target estimand.
Run data-quality, missingness, balance, and overlap checks before fitting CATE models.
Prepare X and W matrices for EconML without leaking test information into training.
Fit LinearDML, CausalForestDML, and DRLearner on the same design.
Compare ATE bias, CATE recovery, segment behavior, targeting value, budget curves, and runtime.
Turn CATE estimates into a practical policy recommendation with clear limitations.

Notebook Flow

The case study follows a reusable analysis pattern:

Define the decision and causal roles.
Generate a realistic observational dataset with known hidden truth.
Inspect field definitions, missingness, raw gaps, imbalance, and overlap.
Build train/test matrices with clean imputation.
Fit nuisance models and EconML treatment-effect estimators.
Evaluate held-out CATE recovery and segment stability.
Translate estimates into cost-aware targeting policies.
Save report-ready tables, figures, and a short written recommendation.

Setup

The first code cell imports libraries, configures plotting, suppresses noisy implementation warnings, and creates output folders. MPLCONFIGDIR is set before importing Matplotlib so notebook execution does not create cache warnings on machines where the default Matplotlib config directory is not writable.

from pathlib import Path
import os
import time
import warnings

NOTEBOOK_DIR = Path.cwd()
if NOTEBOOK_DIR.name != "econml":
    NOTEBOOK_DIR = Path("notebooks/tutorials/econml").resolve()

OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
for directory in [FIGURE_DIR, TABLE_DIR, REPORT_DIR, OUTPUT_DIR / "matplotlib_cache"]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str((OUTPUT_DIR / "matplotlib_cache").resolve()))

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*IProgress.*")
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")
warnings.filterwarnings("ignore", message=".*column-vector y was passed.*")
warnings.filterwarnings("ignore", message=".*lbfgs failed to converge.*")
warnings.filterwarnings("ignore", message=".*invalid value encountered.*")
warnings.filterwarnings("ignore", message=".*Co-variance matrix is underdetermined.*")

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from scipy.special import expit
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

from econml.dml import CausalForestDML, LinearDML
from econml.dr import DRLearner

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.dpi"] = 120
pd.set_option("display.max_columns", 80)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

The output directories are now ready. Every artifact created by this notebook uses a 14_ prefix so it can be separated from the earlier tutorial outputs.

Reproducibility Settings

This notebook uses random simulation, train-test splitting, tree-based nuisance models, and bootstrap resampling. A fixed seed makes the results stable across runs while still giving us a realistic observational dataset.

RANDOM_STATE = 20260430
rng = np.random.default_rng(RANDOM_STATE)
CASE_PREFIX = "14"
N_USERS = 5_500
TEST_SIZE = 0.30
TREATMENT_COST = 0.18

settings = pd.DataFrame(
    {
        "setting": ["random_state", "n_users", "test_size", "treatment_cost"],
        "value": [RANDOM_STATE, N_USERS, TEST_SIZE, TREATMENT_COST],
        "why_it_matters": [
            "Makes simulation, model fitting, and bootstrap resampling reproducible.",
            "Keeps the notebook fast while leaving enough data for heterogeneous-effect diagnostics.",
            "Creates a held-out sample for estimator comparison and policy evaluation.",
            "Converts gross CATE into net decision value for targeting.",
        ],
    }
)
settings.to_csv(TABLE_DIR / f"{CASE_PREFIX}_analysis_settings.csv", index=False)
display(settings)

	setting	value	why_it_matters
0	random_state	20,260,430.0000	Makes simulation, model fitting, and bootstrap...
1	n_users	5,500.0000	Keeps the notebook fast while leaving enough d...
2	test_size	0.3000	Creates a held-out sample for estimator compar...
3	treatment_cost	0.1800	Converts gross CATE into net decision value fo...

The treatment cost is measured in the same units as the outcome. A user should be targeted only if the expected incremental value is larger than this cost. That distinction turns effect estimation into a decision problem.

Product Decision And Causal Roles

The hypothetical decision is whether a user should receive a proactive onboarding bundle during their first week. The bundle includes extra guidance, recommended paths, and reminders. The outcome is engagement value over the next eight weeks.

The data are observational. Higher-need and higher-potential users are more likely to receive the bundle, so raw treated-control differences will mix treatment impact with selection bias.

role_map = pd.DataFrame(
    [
        ("Treatment T", "onboarding_bundle", "Whether the user received the proactive onboarding bundle.", "T"),
        ("Outcome Y", "eight_week_value", "Continuous engagement value after the treatment window.", "Y"),
        ("Effect modifiers X", "need, friction, preference, and reliability fields", "Variables that may change how much the bundle helps.", "X"),
        ("Controls W", "tenure, market, source, and operational fields", "Observed confounders adjusted for in nuisance models.", "W"),
        ("Hidden teaching truth", "true_cate and true_propensity", "Known only because this is a simulation; never used for fitting.", "Evaluation only"),
    ],
    columns=["role", "field", "meaning", "used_by_econml_as"],
)
role_map.to_csv(TABLE_DIR / f"{CASE_PREFIX}_causal_role_map.csv", index=False)
display(role_map)

	role	field	meaning	used_by_econml_as
0	Treatment T	onboarding_bundle	Whether the user received the proactive onboar...	T
1	Outcome Y	eight_week_value	Continuous engagement value after the treatmen...	Y
2	Effect modifiers X	need, friction, preference, and reliability fi...	Variables that may change how much the bundle ...	X
3	Controls W	tenure, market, source, and operational fields	Observed confounders adjusted for in nuisance ...	W
4	Hidden teaching truth	true_cate and true_propensity	Known only because this is a simulation; never...	Evaluation only

This role table is the first guardrail. EconML is an estimation library; it still depends on the analyst to define treatment, outcome, confounders, and effect modifiers before fitting anything.

Data-Generating Process

The helper functions below create three hidden pieces of the synthetic world:

A true CATE function that determines how much each user benefits from onboarding.
A targeted treatment assignment process that creates confounding.
A baseline outcome function that determines engagement even without treatment.

The true CATE is nonlinear on purpose, so the case study can test whether flexible estimators are useful.

def make_correlated_normals(n_rows, n_features, correlation, rng):
    covariance = np.full((n_features, n_features), correlation)
    np.fill_diagonal(covariance, 1.0)
    return rng.multivariate_normal(np.zeros(n_features), covariance, size=n_rows)


def true_cate_function(frame):
    return (
        0.26
        + 0.28 * frame["onboarding_gap"]
        + 0.20 * frame["content_diversity"]
        - 0.22 * frame["support_contacts"]
        - 0.16 * frame["price_sensitivity"]
        + 0.18 * frame["high_motivation_segment"]
        + 0.15 * np.maximum(frame["search_difficulty"], 0)
        + 0.10 * frame["device_reliability"] * frame["content_diversity"]
        - 0.12 * frame["region_competition"]
    )


def treatment_propensity_function(frame):
    logit = (
        -0.20
        + 0.55 * frame["onboarding_gap"]
        + 0.42 * frame["prior_sessions"]
        + 0.35 * frame["high_motivation_segment"]
        + 0.30 * frame["content_diversity"]
        + 0.24 * frame["marketing_pressure"]
        - 0.28 * frame["device_reliability"]
        + 0.18 * frame["support_contacts"]
        - 0.16 * frame["referral_organic"]
    )
    return np.clip(expit(logit), 0.05, 0.95)


def baseline_outcome_function(frame):
    return (
        1.40
        + 0.45 * frame["prior_sessions"]
        + 0.35 * frame["content_diversity"]
        - 0.32 * frame["support_contacts"]
        - 0.28 * frame["search_difficulty"]
        - 0.22 * frame["price_sensitivity"]
        + 0.24 * frame["device_reliability"]
        + 0.18 * frame["market_maturity"]
        + 0.12 * frame["referral_organic"]
        + 0.08 * frame["weekday_signup"]
        + 0.10 * frame["account_complexity"]
        + 0.05 * np.sin(frame["tenure_weeks"] / 3.0)
    )

The CATE function says that onboarding helps users more when they have clear onboarding gaps, diverse interests, and search friction. It helps less when support burden and price sensitivity are high. That gives us heterogeneous effects with an intuitive story.

Generate The User-Level Table

The next cell creates the observed dataset and inserts modest missingness into a few pre-treatment fields. Missingness makes the preparation step more realistic without turning the tutorial into a data-cleaning exercise.

latent = make_correlated_normals(N_USERS, 6, correlation=0.28, rng=rng)
users = pd.DataFrame(
    {
        "user_id": np.arange(1, N_USERS + 1),
        "onboarding_gap": latent[:, 0],
        "prior_sessions": latent[:, 1],
        "content_diversity": latent[:, 2],
        "search_difficulty": latent[:, 3],
        "price_sensitivity": latent[:, 4],
        "device_reliability": latent[:, 5],
    }
)

users["support_contacts"] = np.clip(rng.poisson(np.exp(0.10 + 0.25 * users["search_difficulty"]), N_USERS), 0, 6)
users["tenure_weeks"] = np.clip(rng.gamma(shape=3.0, scale=3.0, size=N_USERS), 1, 40)
users["region_competition"] = rng.normal(0, 1, N_USERS)
users["market_maturity"] = rng.normal(0, 1, N_USERS)
users["marketing_pressure"] = rng.normal(0, 1, N_USERS)
users["account_complexity"] = rng.normal(0, 1, N_USERS)
users["referral_organic"] = rng.binomial(1, expit(0.15 + 0.25 * users["content_diversity"]), N_USERS)
users["weekday_signup"] = rng.binomial(1, 0.72, N_USERS)
users["mobile_primary"] = rng.binomial(1, expit(0.05 - 0.20 * users["device_reliability"]), N_USERS)
users["high_motivation_segment"] = rng.binomial(
    1,
    expit(0.35 * users["prior_sessions"] + 0.25 * users["content_diversity"] - 0.10 * users["price_sensitivity"]),
    N_USERS,
)

users["true_cate"] = true_cate_function(users)
users["true_propensity"] = treatment_propensity_function(users)
users["onboarding_bundle"] = rng.binomial(1, users["true_propensity"], N_USERS)
users["baseline_value"] = baseline_outcome_function(users)
users["eight_week_value"] = users["baseline_value"] + users["onboarding_bundle"] * users["true_cate"] + rng.normal(0, 0.75, N_USERS)

for column, missing_rate in {
    "content_diversity": 0.035,
    "search_difficulty": 0.025,
    "support_contacts": 0.020,
    "device_reliability": 0.030,
}.items():
    users.loc[rng.random(N_USERS) < missing_rate, column] = np.nan

users.to_csv(TABLE_DIR / f"{CASE_PREFIX}_case_study_teaching_data.csv", index=False)
display(users.head())

	user_id	onboarding_gap	prior_sessions	content_diversity	search_difficulty	price_sensitivity	device_reliability	support_contacts	tenure_weeks	region_competition	market_maturity	marketing_pressure	account_complexity	referral_organic	weekday_signup	mobile_primary	high_motivation_segment	true_cate	true_propensity	onboarding_bundle	baseline_value	eight_week_value
0	1	0.7118	0.0075	1.2556	0.0693	-0.3601	0.6739	0.0000	10.3103	-1.2263	-1.1327	0.4568	-0.3511	0	0	0	1	1.1902	0.6989	1	1.8108	2.7246
1	2	-0.0587	-0.3216	-0.1668	1.6010	-0.8754	-0.8493	2.0000	13.5698	0.6656	-0.1502	-0.3617	-0.2442	0	0	0	1	0.2647	0.6091	0	-0.0032	0.6347
2	3	1.8492	0.5357	0.2346	1.0726	0.4464	0.9111	1.0000	4.8698	0.4456	0.0216	0.8945	-0.5806	0	0	1	0	0.6621	0.7776	1	1.2190	1.8847
3	4	0.0469	0.0257	0.6607	2.2414	-0.5130	-0.9861	1.0000	4.3458	-1.5559	-0.7411	-0.9080	-0.7013	1	1	1	1	0.9051	0.6137	0	0.6175	0.0734
4	5	0.2438	1.3964	-1.3551	0.1775	-0.9877	-1.0327	1.0000	17.7193	0.8292	0.0023	2.4848	-0.0242	0	1	0	0	0.0623	0.7649	1	1.2134	0.9457

The dataset now contains the observed fields an analyst would use and the hidden truth fields used for teaching checks. In a real analysis, only observed pre-treatment fields, treatment, and outcome would be available.

Field Dictionary

A field dictionary makes the data roles auditable. It also helps avoid using post-treatment or hidden teaching fields accidentally in the model matrix.

field_dictionary = pd.DataFrame(
    [
        ("user_id", "Identifier", "Row identifier used only for tracking.", "Observed", "Index only"),
        ("onboarding_gap", "Numeric", "How much help the user appears to need during onboarding.", "Observed", "X"),
        ("prior_sessions", "Numeric", "Early usage intensity before the bundle decision.", "Observed", "X"),
        ("content_diversity", "Numeric", "Breadth of content interests visible before treatment.", "Observed", "X"),
        ("search_difficulty", "Numeric", "Signal that the user struggles to find useful content.", "Observed", "X"),
        ("support_contacts", "Numeric", "Number of early support interactions before treatment.", "Observed", "X"),
        ("price_sensitivity", "Numeric", "Proxy for whether cost concerns limit engagement.", "Observed", "X"),
        ("device_reliability", "Numeric", "Proxy for whether technical quality enables benefit.", "Observed", "X"),
        ("high_motivation_segment", "Binary", "Pre-treatment segment flag from early intent signals.", "Observed", "X"),
        ("tenure_weeks", "Numeric", "Account age at decision time.", "Observed", "W"),
        ("region_competition", "Numeric", "Local competitive intensity proxy.", "Observed", "W"),
        ("market_maturity", "Numeric", "Market environment quality proxy.", "Observed", "W"),
        ("marketing_pressure", "Numeric", "Recent non-bundle marketing pressure.", "Observed", "W"),
        ("account_complexity", "Numeric", "Operational complexity proxy.", "Observed", "W"),
        ("referral_organic", "Binary", "Whether acquisition source was organic.", "Observed", "W"),
        ("weekday_signup", "Binary", "Whether signup happened on a weekday.", "Observed", "W"),
        ("mobile_primary", "Binary", "Whether early activity was mostly mobile.", "Observed", "W"),
        ("onboarding_bundle", "Binary", "Treatment assignment.", "Observed", "T"),
        ("eight_week_value", "Numeric", "Post-treatment outcome.", "Observed", "Y"),
        ("true_cate", "Numeric", "Known individual treatment effect for teaching evaluation.", "Hidden in real work", "Evaluation only"),
        ("true_propensity", "Numeric", "Known treatment probability for teaching evaluation.", "Hidden in real work", "Evaluation only"),
        ("baseline_value", "Numeric", "Known untreated outcome component.", "Hidden in real work", "Evaluation only"),
    ],
    columns=["field", "type", "description", "availability", "analysis_role"],
)
field_dictionary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_field_dictionary.csv", index=False)
display(field_dictionary)

	field	type	description	availability	analysis_role
0	user_id	Identifier	Row identifier used only for tracking.	Observed	Index only
1	onboarding_gap	Numeric	How much help the user appears to need during ...	Observed	X
2	prior_sessions	Numeric	Early usage intensity before the bundle decision.	Observed	X
3	content_diversity	Numeric	Breadth of content interests visible before tr...	Observed	X
4	search_difficulty	Numeric	Signal that the user struggles to find useful ...	Observed	X
5	support_contacts	Numeric	Number of early support interactions before tr...	Observed	X
6	price_sensitivity	Numeric	Proxy for whether cost concerns limit engagement.	Observed	X
7	device_reliability	Numeric	Proxy for whether technical quality enables be...	Observed	X
8	high_motivation_segment	Binary	Pre-treatment segment flag from early intent s...	Observed	X
9	tenure_weeks	Numeric	Account age at decision time.	Observed	W
10	region_competition	Numeric	Local competitive intensity proxy.	Observed	W
11	market_maturity	Numeric	Market environment quality proxy.	Observed	W
12	marketing_pressure	Numeric	Recent non-bundle marketing pressure.	Observed	W
13	account_complexity	Numeric	Operational complexity proxy.	Observed	W
14	referral_organic	Binary	Whether acquisition source was organic.	Observed	W
15	weekday_signup	Binary	Whether signup happened on a weekday.	Observed	W
16	mobile_primary	Binary	Whether early activity was mostly mobile.	Observed	W
17	onboarding_bundle	Binary	Treatment assignment.	Observed	T
18	eight_week_value	Numeric	Post-treatment outcome.	Observed	Y
19	true_cate	Numeric	Known individual treatment effect for teaching...	Hidden in real work	Evaluation only
20	true_propensity	Numeric	Known treatment probability for teaching evalu...	Hidden in real work	Evaluation only
21	baseline_value	Numeric	Known untreated outcome component.	Hidden in real work	Evaluation only

The hidden teaching fields are clearly labeled as evaluation-only. This habit matters because accidental leakage can make causal models look much better than they are.

Basic Shape And Missingness

The first data check confirms the number of rows, treatment/control counts, and missing-value rates. Treatment and outcome must be complete for the simple workflow used here.

basic_shape = pd.DataFrame(
    {
        "metric": ["rows", "columns", "treated_rows", "control_rows", "outcome_missing", "treatment_missing"],
        "value": [
            len(users),
            users.shape[1],
            int(users["onboarding_bundle"].sum()),
            int((1 - users["onboarding_bundle"]).sum()),
            int(users["eight_week_value"].isna().sum()),
            int(users["onboarding_bundle"].isna().sum()),
        ],
    }
)
missing_summary = (
    users.isna().mean().rename("missing_rate").reset_index().rename(columns={"index": "field"})
    .query("missing_rate > 0")
    .sort_values("missing_rate", ascending=False)
)
basic_shape.to_csv(TABLE_DIR / f"{CASE_PREFIX}_basic_shape.csv", index=False)
missing_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_missing_summary.csv", index=False)
display(basic_shape)
display(missing_summary)

	metric	value
0	rows	5500
1	columns	22
2	treated_rows	2890
3	control_rows	2610
4	outcome_missing	0
5	treatment_missing	0

	field	missing_rate
3	content_diversity	0.0322
6	device_reliability	0.0278
4	search_difficulty	0.0240
7	support_contacts	0.0215

Treatment and outcome are complete. A few pre-treatment fields have light missingness, so the modeling section will impute them using train-only medians and add missingness flags.

Raw Treated-Control Difference

A raw outcome comparison is a useful descriptive baseline, but it is not a causal estimate when treatment assignment is targeted. This cell compares the observed raw gap to the true average effect known from the simulation.

raw_group_summary = (
    users.groupby("onboarding_bundle")
    .agg(
        users=("user_id", "size"),
        observed_mean_outcome=("eight_week_value", "mean"),
        true_mean_cate=("true_cate", "mean"),
        mean_true_propensity=("true_propensity", "mean"),
    )
    .reset_index()
)
raw_difference = raw_group_summary.loc[raw_group_summary["onboarding_bundle"] == 1, "observed_mean_outcome"].iloc[0] - raw_group_summary.loc[raw_group_summary["onboarding_bundle"] == 0, "observed_mean_outcome"].iloc[0]
true_population_ate = users["true_cate"].mean()
raw_vs_truth = pd.DataFrame(
    {
        "quantity": ["raw_treated_minus_control", "true_population_ate", "raw_minus_true_ate"],
        "value": [raw_difference, true_population_ate, raw_difference - true_population_ate],
    }
)
raw_group_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_raw_group_summary.csv", index=False)
raw_vs_truth.to_csv(TABLE_DIR / f"{CASE_PREFIX}_raw_difference_vs_truth.csv", index=False)
display(raw_group_summary)
display(raw_vs_truth)

	onboarding_bundle	users	observed_mean_outcome	true_mean_cate	mean_true_propensity
0	0	2610	1.0686	0.0881	0.4417
1	1	2890	1.5806	0.2930	0.5946

	quantity	value
0	raw_treated_minus_control	0.5120
1	true_population_ate	0.1958
2	raw_minus_true_ate	0.3163

The raw gap is not the same as the true average effect. The treated users also have a different average assignment propensity, which is direct evidence of selection into treatment.

True Effect Distribution For Teaching

Because the data are simulated, we can inspect the true treatment-effect distribution. This is not available in real work, but it helps explain why heterogeneous-effect modeling is useful.

true_effect_summary = users["true_cate"].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).to_frame("true_cate")
true_effect_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_true_effect_summary.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(users["true_cate"], bins=45, kde=True, color="#2563eb", ax=ax)
ax.axvline(TREATMENT_COST, color="#dc2626", linestyle="--", linewidth=1.6, label=f"Treatment cost = {TREATMENT_COST:.2f}")
ax.axvline(users["true_cate"].mean(), color="#111827", linewidth=1.3, label="True mean CATE")
ax.set_title("True Heterogeneous Treatment Effects")
ax.set_xlabel("True CATE")
ax.set_ylabel("Users")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_true_cate_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

display(true_effect_summary)

	true_cate
count	5,500.0000
mean	0.1958
std	0.4820
min	-1.5732
5%	-0.5931
25%	-0.1223
50%	0.1933
75%	0.5051
95%	1.0103
max	2.0447

The distribution has meaningful spread around the average. Some users are much more attractive to target than others once cost is considered.

Covariate Balance Before Adjustment

Standardized mean differences compare treated and control groups on pre-treatment variables. Large absolute values indicate that treatment assignment is related to user characteristics.

OBSERVED_FEATURES = [
    "onboarding_gap", "prior_sessions", "content_diversity", "search_difficulty", "support_contacts",
    "price_sensitivity", "device_reliability", "high_motivation_segment", "tenure_weeks",
    "region_competition", "market_maturity", "marketing_pressure", "account_complexity",
    "referral_organic", "weekday_signup", "mobile_primary",
]

balance_rows = []
for feature in OBSERVED_FEATURES:
    treated = users.loc[users["onboarding_bundle"] == 1, feature]
    control = users.loc[users["onboarding_bundle"] == 0, feature]
    pooled_sd = np.sqrt((treated.var(skipna=True) + control.var(skipna=True)) / 2)
    smd = (treated.mean(skipna=True) - control.mean(skipna=True)) / pooled_sd if pooled_sd > 0 else 0.0
    balance_rows.append(
        {
            "feature": feature,
            "treated_mean": treated.mean(skipna=True),
            "control_mean": control.mean(skipna=True),
            "standardized_mean_difference": smd,
            "abs_smd": abs(smd),
        }
    )
balance_table = pd.DataFrame(balance_rows).sort_values("abs_smd", ascending=False)
balance_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_covariate_balance.csv", index=False)
display(balance_table.head(12))

	feature	treated_mean	control_mean	standardized_mean_difference	abs_smd
0	onboarding_gap	0.2406	-0.2843	0.5503	0.5503
1	prior_sessions	0.2548	-0.2455	0.5121	0.5121
2	content_diversity	0.2235	-0.2174	0.4537	0.4537
3	search_difficulty	0.1248	-0.1917	0.3176	0.3176
7	high_motivation_segment	0.5651	0.4406	0.2508	0.2508
5	price_sensitivity	0.1005	-0.1055	0.2071	0.2071
11	marketing_pressure	0.1125	-0.0906	0.2062	0.2062
4	support_contacts	1.2205	1.0035	0.2036	0.2036
6	device_reliability	0.0726	-0.0835	0.1549	0.1549
14	weekday_signup	0.7087	0.7356	-0.0603	0.0603
12	account_complexity	-0.0402	0.0069	-0.0469	0.0469
9	region_competition	-0.0289	0.0049	-0.0343	0.0343

The largest imbalances identify why the raw comparison is risky. Several variables that influence treatment also influence outcomes.

Balance Plot

The balance plot makes imbalance easier to scan. Dashed reference lines at +0.10 and -0.10 mark a common practical threshold for variables worth investigating.

fig, ax = plt.subplots(figsize=(9, 6))
plot_balance = balance_table.sort_values("standardized_mean_difference")
sns.barplot(data=plot_balance, x="standardized_mean_difference", y="feature", color="#60a5fa", ax=ax)
ax.axvline(0, color="#111827", linewidth=1)
ax.axvline(0.10, color="#dc2626", linestyle="--", linewidth=1)
ax.axvline(-0.10, color="#dc2626", linestyle="--", linewidth=1)
ax.set_title("Pre-Treatment Covariate Imbalance")
ax.set_xlabel("Standardized Mean Difference")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_covariate_balance.png", dpi=160, bbox_inches="tight")
plt.show()

The imbalance pattern is exactly what motivates causal adjustment. The assignment rule has targeted users with different pre-treatment profiles.

Propensity Overlap Diagnostic

Overlap means similar users can be observed both treated and untreated. If estimated propensity scores are extremely close to zero or one, CATE estimates become more dependent on extrapolation.

propensity_frame = users[OBSERVED_FEATURES].fillna(users[OBSERVED_FEATURES].median())
diagnostic_propensity_model = LogisticRegression(max_iter=1_000, solver="lbfgs")
diagnostic_propensity_model.fit(propensity_frame, users["onboarding_bundle"])
users["diagnostic_propensity"] = diagnostic_propensity_model.predict_proba(propensity_frame)[:, 1]

propensity_summary = users["diagnostic_propensity"].describe(percentiles=[0.01, 0.05, 0.10, 0.50, 0.90, 0.95, 0.99]).to_frame("diagnostic_propensity")
propensity_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_diagnostic_propensity_summary.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.kdeplot(data=users, x="diagnostic_propensity", hue="onboarding_bundle", common_norm=False, fill=True, alpha=0.25, ax=ax)
ax.axvline(0.05, color="#dc2626", linestyle="--", linewidth=1)
ax.axvline(0.95, color="#dc2626", linestyle="--", linewidth=1)
ax.set_title("Estimated Propensity Overlap")
ax.set_xlabel("Diagnostic Propensity Score")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_diagnostic_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

display(propensity_summary)

	diagnostic_propensity
count	5,500.0000
mean	0.5255
std	0.1915
min	0.0563
1%	0.1259
5%	0.2071
10%	0.2619
50%	0.5312
90%	0.7811
95%	0.8259
99%	0.9010
max	0.9715

Overlap is usable but not perfect. That means the analysis can proceed, while later recommendations should still be cautious around extreme-propensity users.

Train-Test Split

The estimators are fit on the training set and evaluated on a held-out set. Stratifying by treatment keeps the treated/control share similar in both samples.

train_df, test_df = train_test_split(
    users,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=users["onboarding_bundle"],
)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

split_summary = pd.DataFrame(
    {
        "split": ["train", "test"],
        "rows": [len(train_df), len(test_df)],
        "treatment_rate": [train_df["onboarding_bundle"].mean(), test_df["onboarding_bundle"].mean()],
        "mean_outcome": [train_df["eight_week_value"].mean(), test_df["eight_week_value"].mean()],
        "mean_true_cate": [train_df["true_cate"].mean(), test_df["true_cate"].mean()],
    }
)
split_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_train_test_split_summary.csv", index=False)
display(split_summary)

	split	rows	treatment_rate	mean_outcome	mean_true_cate
0	train	3850	0.5255	1.3454	0.2020
1	test	1650	0.5255	1.3194	0.1812

The splits are comparable in size, treatment rate, and average true effect. That creates a stable evaluation setup for the rest of the notebook.

Feature Roles For EconML

X contains effect modifiers: variables we want the CATE function to vary with and variables a targeting rule may use. W contains additional controls used for adjustment in nuisance models.

X_COLUMNS = [
    "onboarding_gap", "prior_sessions", "content_diversity", "search_difficulty",
    "support_contacts", "price_sensitivity", "device_reliability", "high_motivation_segment",
]
W_COLUMNS = [
    "tenure_weeks", "region_competition", "market_maturity", "marketing_pressure",
    "account_complexity", "referral_organic", "weekday_signup", "mobile_primary",
]

feature_role_table = pd.DataFrame(
    {
        "field": X_COLUMNS + W_COLUMNS,
        "econml_role": ["X"] * len(X_COLUMNS) + ["W"] * len(W_COLUMNS),
    }
)
feature_role_table["reason"] = [
    "May directly change how much help onboarding provides.",
    "Early activity can modify treatment value.",
    "Breadth of interests can make guided discovery more useful.",
    "Search difficulty is a clear need signal.",
    "Support burden may limit incremental benefit.",
    "Cost concerns can reduce realized impact.",
    "Technical reliability can enable or block benefit.",
    "Motivation segment is an actionable effect modifier.",
    "Account age adjusts for maturity at decision time.",
    "Market competition affects both assignment and outcome.",
    "Market maturity adjusts for environment quality.",
    "Other marketing pressure can confound outcomes.",
    "Operational complexity adjusts for baseline value.",
    "Acquisition source can confound early engagement.",
    "Signup day controls timing patterns.",
    "Primary device controls experience differences.",
]
feature_role_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_feature_role_table.csv", index=False)
display(feature_role_table)

	field	econml_role	reason
0	onboarding_gap	X	May directly change how much help onboarding p...
1	prior_sessions	X	Early activity can modify treatment value.
2	content_diversity	X	Breadth of interests can make guided discovery...
3	search_difficulty	X	Search difficulty is a clear need signal.
4	support_contacts	X	Support burden may limit incremental benefit.
5	price_sensitivity	X	Cost concerns can reduce realized impact.
6	device_reliability	X	Technical reliability can enable or block bene...
7	high_motivation_segment	X	Motivation segment is an actionable effect mod...
8	tenure_weeks	W	Account age adjusts for maturity at decision t...
9	region_competition	W	Market competition affects both assignment and...
10	market_maturity	W	Market maturity adjusts for environment quality.
11	marketing_pressure	W	Other marketing pressure can confound outcomes.
12	account_complexity	W	Operational complexity adjusts for baseline va...
13	referral_organic	W	Acquisition source can confound early engagement.
14	weekday_signup	W	Signup day controls timing patterns.
15	mobile_primary	W	Primary device controls experience differences.

This split keeps the CATE surface focused on user-need and product-experience fields. The controls still adjust for confounding, but they are not the main segmentation story.

Train-Only Imputation And Matrix Construction

Missing values are imputed using training medians only. Missingness flags are placed in W, allowing nuisance models to learn whether missingness itself is predictive without making the CATE surface harder to explain.

ALL_MODEL_COLUMNS = X_COLUMNS + W_COLUMNS
train_medians = train_df[ALL_MODEL_COLUMNS].median(numeric_only=True)
MISSING_FLAG_COLUMNS = [column for column in ALL_MODEL_COLUMNS if train_df[column].isna().any() or test_df[column].isna().any()]


def build_design_matrices(frame, medians):
    raw_features = frame[ALL_MODEL_COLUMNS].copy()
    missing_flags = raw_features[MISSING_FLAG_COLUMNS].isna().astype(float).add_suffix("_missing")
    filled_features = raw_features.fillna(medians)
    X = filled_features[X_COLUMNS].astype(float)
    W = pd.concat([filled_features[W_COLUMNS].astype(float), missing_flags], axis=1)
    return X, W

X_train, W_train = build_design_matrices(train_df, train_medians)
X_test, W_test = build_design_matrices(test_df, train_medians)
T_train = train_df["onboarding_bundle"].astype(int)
T_test = test_df["onboarding_bundle"].astype(int)
Y_train = train_df["eight_week_value"].astype(float)
Y_test = test_df["eight_week_value"].astype(float)
true_tau_test = test_df["true_cate"].to_numpy()

matrix_summary = pd.DataFrame(
    {
        "matrix": ["X_train", "W_train", "X_test", "W_test"],
        "rows": [X_train.shape[0], W_train.shape[0], X_test.shape[0], W_test.shape[0]],
        "columns": [X_train.shape[1], W_train.shape[1], X_test.shape[1], W_test.shape[1]],
        "missing_values_after_preparation": [
            int(X_train.isna().sum().sum()), int(W_train.isna().sum().sum()),
            int(X_test.isna().sum().sum()), int(W_test.isna().sum().sum()),
        ],
    }
)
matrix_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_model_matrix_summary.csv", index=False)
display(matrix_summary)

	matrix	rows	columns
0	X_train	3850	8
1	W_train	3850	12
2	X_test	1650	8
3	W_test	1650	12

The prepared matrices contain no missing values. The key practice is that test rows did not influence the imputation values used for training.

Nuisance Model Diagnostics

Treatment-effect estimators rely on nuisance models for treatment assignment and outcomes. These diagnostics check whether observed pre-treatment features contain enough predictive signal to support adjustment.

def make_rf_classifier(seed_offset=0):
    return RandomForestClassifier(
        n_estimators=180,
        min_samples_leaf=30,
        max_features=0.75,
        random_state=RANDOM_STATE + seed_offset,
        n_jobs=-1,
    )


def make_rf_regressor(seed_offset=0):
    return RandomForestRegressor(
        n_estimators=180,
        min_samples_leaf=25,
        max_features=0.75,
        random_state=RANDOM_STATE + seed_offset,
        n_jobs=-1,
    )

XW_train = pd.concat([X_train, W_train], axis=1)
XW_test = pd.concat([X_test, W_test], axis=1)

propensity_model = make_rf_classifier(10)
propensity_model.fit(XW_train, T_train)
propensity_pred = np.clip(propensity_model.predict_proba(XW_test)[:, 1], 1e-4, 1 - 1e-4)

outcome_model_control = make_rf_regressor(20)
outcome_model_treated = make_rf_regressor(30)
outcome_model_control.fit(XW_train.loc[T_train == 0], Y_train.loc[T_train == 0])
outcome_model_treated.fit(XW_train.loc[T_train == 1], Y_train.loc[T_train == 1])
mu0_pred = outcome_model_control.predict(XW_test)
mu1_pred = outcome_model_treated.predict(XW_test)
outcome_pred_observed_arm = np.where(T_test.to_numpy() == 1, mu1_pred, mu0_pred)

nuisance_diagnostics = pd.DataFrame(
    {
        "diagnostic": [
            "propensity_auc", "propensity_log_loss", "propensity_min", "propensity_max",
            "outcome_observed_arm_r2", "outcome_observed_arm_mae",
        ],
        "value": [
            roc_auc_score(T_test, propensity_pred), log_loss(T_test, propensity_pred),
            propensity_pred.min(), propensity_pred.max(),
            r2_score(Y_test, outcome_pred_observed_arm), mean_absolute_error(Y_test, outcome_pred_observed_arm),
        ],
    }
)
nuisance_diagnostics.to_csv(TABLE_DIR / f"{CASE_PREFIX}_nuisance_diagnostics.csv", index=False)
display(nuisance_diagnostics)

	diagnostic	value
0	propensity_auc	0.7218
1	propensity_log_loss	0.6137
2	propensity_min	0.1216
3	propensity_max	0.9082
4	outcome_observed_arm_r2	0.5002
5	outcome_observed_arm_mae	0.6858

The treatment model has signal, confirming that treatment was targeted. The outcome model is also informative enough to proceed, though it remains an approximation.

Overlap Buckets On The Held-Out Sample

Bucketed propensities reveal where support is strong or weak. The later policy section will use these scores to create a conservative support-aware targeting rule.

overlap_frame = test_df[["user_id", "onboarding_bundle", "eight_week_value", "true_cate"]].copy()
overlap_frame["estimated_propensity"] = propensity_pred
overlap_frame["overlap_bucket"] = pd.cut(
    overlap_frame["estimated_propensity"],
    bins=[0.0, 0.10, 0.25, 0.50, 0.75, 0.90, 1.0],
    include_lowest=True,
)
overlap_summary = (
    overlap_frame.groupby("overlap_bucket", observed=True)
    .agg(
        users=("user_id", "size"),
        treatment_rate=("onboarding_bundle", "mean"),
        mean_true_cate=("true_cate", "mean"),
        mean_outcome=("eight_week_value", "mean"),
    )
    .reset_index()
)
overlap_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_overlap_bucket_summary.csv", index=False)
display(overlap_summary)

	overlap_bucket	users	treatment_rate	mean_true_cate	mean_outcome
0	(0.1, 0.25]	98	0.2347	-0.2653	0.3108
1	(0.25, 0.5]	647	0.3787	-0.0381	0.9534
2	(0.5, 0.75]	724	0.6105	0.3050	1.5611
3	(0.75, 0.9]	179	0.8659	0.7032	2.1917
4	(0.9, 1.0]	2	1.0000	1.4372	3.5733

Most users sit away from the most extreme propensity regions. That is encouraging, but the edge buckets are still useful for cautious reporting.

Estimator Strategy

The case study uses three complementary EconML estimators. The goal is not to crown a universal winner; it is to test whether the recommendation is stable across reasonable modeling choices.

estimator_setup = pd.DataFrame(
    [
        ("LinearDML", "Transparent structured heterogeneity with polynomial features.", "Can miss sharp nonlinearities.", "Readable baseline."),
        ("CausalForestDML", "Flexible nonlinear heterogeneity.", "Can be noisier and slower.", "Matches the nonlinear teaching setup."),
        ("DRLearner", "Doubly robust pseudo-outcome strategy.", "Can inherit variance from propensity estimates.", "Useful for robust targeting comparison."),
    ],
    columns=["estimator", "main_strength", "main_risk", "why_used_here"],
)
estimator_setup.to_csv(TABLE_DIR / f"{CASE_PREFIX}_estimator_setup.csv", index=False)
display(estimator_setup)

	estimator	main_strength	main_risk	why_used_here
0	LinearDML	Transparent structured heterogeneity with poly...	Can miss sharp nonlinearities.	Readable baseline.
1	CausalForestDML	Flexible nonlinear heterogeneity.	Can be noisier and slower.	Matches the nonlinear teaching setup.
2	DRLearner	Doubly robust pseudo-outcome strategy.	Can inherit variance from propensity estimates.	Useful for robust targeting comparison.

The table names the modeling tradeoff before results are known. That keeps the comparison from becoming pure scoreboard chasing.

Fit EconML Estimators

This cell fits all estimators on the same training matrices and records runtime. The treatment is binary, so the DML estimators are configured for discrete treatment where needed.

estimators = {
    "LinearDML": LinearDML(
        model_y=make_rf_regressor(101),
        model_t=make_rf_classifier(102),
        discrete_treatment=True,
        featurizer=PolynomialFeatures(degree=2, include_bias=False),
        cv=3,
        random_state=RANDOM_STATE,
    ),
    "CausalForestDML": CausalForestDML(
        model_y=make_rf_regressor(201),
        model_t=make_rf_classifier(202),
        discrete_treatment=True,
        n_estimators=160,
        min_samples_leaf=18,
        max_samples=0.45,
        cv=3,
        random_state=RANDOM_STATE,
    ),
    "DRLearner": DRLearner(
        model_propensity=make_rf_classifier(301),
        model_regression=make_rf_regressor(302),
        model_final=make_rf_regressor(303),
        min_propensity=0.02,
        cv=3,
        random_state=RANDOM_STATE,
    ),
}

fit_rows = []
for estimator_name, estimator in estimators.items():
    start = time.perf_counter()
    estimator.fit(Y_train, T_train, X=X_train, W=W_train)
    fit_rows.append({"estimator": estimator_name, "fit_seconds": time.perf_counter() - start})

fit_runtime = pd.DataFrame(fit_rows).sort_values("fit_seconds")
fit_runtime.to_csv(TABLE_DIR / f"{CASE_PREFIX}_fit_runtime.csv", index=False)
display(fit_runtime)

	estimator	fit_seconds
0	LinearDML	3.4164
1	CausalForestDML	3.6490
2	DRLearner	3.9710

Runtime is part of practical model selection. A small accuracy improvement may not be worth much if it makes the workflow harder to refresh or explain.

Predict Held-Out CATEs

Each estimator now predicts a treatment effect for each held-out user. The true CATE is attached only for evaluation.

effect_predictions = test_df[["user_id", "onboarding_bundle", "eight_week_value", "true_cate", "diagnostic_propensity"]].copy()
prediction_rows = []
for estimator_name, estimator in estimators.items():
    start = time.perf_counter()
    effect_predictions[estimator_name] = estimator.effect(X_test)
    prediction_rows.append({"estimator": estimator_name, "predict_seconds": time.perf_counter() - start})

prediction_runtime = pd.DataFrame(prediction_rows).sort_values("predict_seconds")
effect_predictions.to_csv(TABLE_DIR / f"{CASE_PREFIX}_heldout_effect_predictions.csv", index=False)
prediction_runtime.to_csv(TABLE_DIR / f"{CASE_PREFIX}_prediction_runtime.csv", index=False)
display(effect_predictions.head())
display(prediction_runtime)

	user_id	onboarding_bundle	eight_week_value	true_cate	diagnostic_propensity	LinearDML	CausalForestDML	DRLearner
0	5064	1	2.0690	0.1123	0.5733	0.1962	0.0582	0.2281
1	1117	1	1.0561	0.1787	0.6108	0.0079	0.1102	0.0641
2	3396	0	1.8315	-0.2375	0.2277	-0.0012	0.0042	-0.1578
3	1511	1	1.4275	0.5047	0.7575	0.7616	0.3912	0.6287
4	5101	1	1.4083	-0.4018	0.7290	-0.1236	0.2198	0.3812

	estimator	predict_seconds
0	LinearDML	0.0030
1	CausalForestDML	0.0695
2	DRLearner	0.0827

The prediction table is the bridge from estimation to action. Once every held-out user has an estimated incremental effect, we can evaluate accuracy, ranking, segments, and policy value.

Held-Out Benchmark Metrics

The benchmark compares estimators using average-effect bias, individual-effect error, correlation with the true effect ranking, top-segment value, and treatment share above cost.

def safe_corr(left, right):
    if np.std(left) == 0 or np.std(right) == 0:
        return np.nan
    return float(np.corrcoef(left, right)[0, 1])

benchmark_rows = []
true_ate_test = true_tau_test.mean()
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    top_mask = estimated >= np.quantile(estimated, 0.80)
    benchmark_rows.append(
        {
            "estimator": estimator_name,
            "true_ate_test": true_ate_test,
            "estimated_ate": estimated.mean(),
            "ate_bias": estimated.mean() - true_ate_test,
            "abs_ate_bias": abs(estimated.mean() - true_ate_test),
            "cate_rmse": mean_squared_error(true_tau_test, estimated) ** 0.5,
            "cate_mae": mean_absolute_error(true_tau_test, estimated),
            "cate_correlation": safe_corr(true_tau_test, estimated),
            "top_20_true_cate_mean": true_tau_test[top_mask].mean(),
            "top_20_estimated_cate_mean": estimated[top_mask].mean(),
            "share_estimated_above_cost": np.mean(estimated > TREATMENT_COST),
        }
    )
benchmark_metrics = pd.DataFrame(benchmark_rows).merge(fit_runtime, on="estimator").merge(prediction_runtime, on="estimator")
benchmark_metrics = benchmark_metrics.sort_values("cate_rmse")
benchmark_metrics.to_csv(TABLE_DIR / f"{CASE_PREFIX}_benchmark_metrics.csv", index=False)
display(benchmark_metrics)

	estimator	true_ate_test	estimated_ate	ate_bias	abs_ate_bias	cate_rmse	cate_mae	cate_correlation	top_20_true_cate_mean	top_20_estimated_cate_mean	share_estimated_above_cost	fit_seconds	predict_seconds
0	LinearDML	0.1812	0.2309	0.0497	0.0497	0.2719	0.2085	0.8948	0.7908	1.0499	0.5394	3.4164	0.0030
2	DRLearner	0.1812	0.2099	0.0287	0.0287	0.2720	0.2129	0.8458	0.7515	0.9236	0.5188	3.9710	0.0827
1	CausalForestDML	0.1812	0.2147	0.0336	0.0336	0.3168	0.2404	0.8173	0.7613	0.5643	0.5558	3.6490	0.0695

The best estimator depends on the use case. For targeting, ranking quality and top-segment value can matter more than a tiny difference in average-effect bias.

Accuracy And Bias Plot

The next figure compares CATE RMSE with absolute ATE bias. This catches models that estimate the average well but miss heterogeneous effects.

metric_plot = benchmark_metrics.melt(
    id_vars="estimator",
    value_vars=["cate_rmse", "abs_ate_bias"],
    var_name="metric",
    value_name="value",
)
metric_plot["metric"] = metric_plot["metric"].map({"cate_rmse": "CATE RMSE", "abs_ate_bias": "Absolute ATE Bias"})

fig, ax = plt.subplots(figsize=(9, 5))
sns.barplot(data=metric_plot, x="value", y="estimator", hue="metric", ax=ax)
ax.set_title("Held-Out Accuracy And Average-Effect Bias")
ax.set_xlabel("Error")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_accuracy_and_bias.png", dpi=160, bbox_inches="tight")
plt.show()

This plot makes the accuracy tradeoff visible. In real work, true CATE would be unavailable, so analogous validation would rely on experiments, temporal checks, or sensitivity analysis.

CATE Recovery Scatterplots

Scatterplots show whether estimated effects recover the shape of the true effect surface. Points near the diagonal are closer to the known truth.

scatter_data = effect_predictions.melt(
    id_vars=["user_id", "true_cate"],
    value_vars=list(estimators.keys()),
    var_name="estimator",
    value_name="estimated_cate",
)
min_value = min(scatter_data["true_cate"].min(), scatter_data["estimated_cate"].min())
max_value = max(scatter_data["true_cate"].max(), scatter_data["estimated_cate"].max())

g = sns.FacetGrid(scatter_data, col="estimator", col_wrap=3, height=3.6, sharex=True, sharey=True)
g.map_dataframe(sns.scatterplot, x="true_cate", y="estimated_cate", alpha=0.28, s=18, color="#2563eb")
for ax in g.axes.flat:
    ax.plot([min_value, max_value], [min_value, max_value], color="#dc2626", linestyle="--", linewidth=1)
    ax.set_xlabel("True CATE")
    ax.set_ylabel("Estimated CATE")
g.fig.suptitle("Held-Out CATE Recovery", y=1.02)
plt.tight_layout()
g.fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_cate_recovery_scatterplots.png", dpi=160, bbox_inches="tight")
plt.show()

The scatterplots show whether estimates are simply compressed toward the mean or whether they separate high- and low-benefit users.

Decile Calibration

Calibration asks whether users placed into higher estimated-effect buckets also have higher true effects. This is a practical targeting diagnostic.

calibration_parts = []
for estimator_name in estimators:
    temp = effect_predictions[["user_id", "true_cate", estimator_name]].copy()
    temp["estimated_cate"] = temp[estimator_name]
    temp["estimated_decile"] = pd.qcut(temp["estimated_cate"], q=10, labels=False, duplicates="drop") + 1
    summary = (
        temp.groupby("estimated_decile", observed=True)
        .agg(
            users=("user_id", "size"),
            mean_estimated_cate=("estimated_cate", "mean"),
            mean_true_cate=("true_cate", "mean"),
        )
        .reset_index()
    )
    summary["estimator"] = estimator_name
    calibration_parts.append(summary)
calibration_table = pd.concat(calibration_parts, ignore_index=True)
calibration_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_decile_calibration.csv", index=False)
display(calibration_table.head(15))

	estimated_decile	users	mean_estimated_cate	mean_true_cate	estimator
0	1	165	-0.8145	-0.5506	LinearDML
1	2	165	-0.3783	-0.2866	LinearDML
2	3	165	-0.1381	-0.1077	LinearDML
3	4	165	0.0292	0.0188	LinearDML
4	5	165	0.1672	0.1223	LinearDML
5	6	165	0.2903	0.2248	LinearDML
6	7	165	0.4427	0.3436	LinearDML
7	8	165	0.6106	0.4654	LinearDML
8	9	165	0.8049	0.6159	LinearDML
9	10	165	1.2949	0.9657	LinearDML
10	1	165	-0.1988	-0.4769	CausalForestDML
11	2	165	-0.0617	-0.2107	CausalForestDML
12	3	165	0.0309	-0.0988	CausalForestDML
13	4	165	0.1072	0.0241	CausalForestDML
14	5	165	0.1828	0.1088	CausalForestDML

A useful targeting model should show rising true effects as estimated deciles increase. If that pattern fails, individual targeting is hard to justify.

Calibration Plot

The plot compares estimated and true average CATE by estimated-effect decile for each estimator.

calibration_plot = calibration_table.melt(
    id_vars=["estimator", "estimated_decile", "users"],
    value_vars=["mean_estimated_cate", "mean_true_cate"],
    var_name="series",
    value_name="cate",
)
calibration_plot["series"] = calibration_plot["series"].map({"mean_estimated_cate": "Estimated", "mean_true_cate": "True"})

g = sns.relplot(
    data=calibration_plot,
    x="estimated_decile",
    y="cate",
    hue="series",
    col="estimator",
    kind="line",
    marker="o",
    col_wrap=3,
    height=3.4,
    facet_kws={"sharey": True},
)
g.set_axis_labels("Estimated CATE Decile", "Average CATE")
g.fig.suptitle("CATE Decile Calibration", y=1.03)
plt.tight_layout()
g.fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_decile_calibration.png", dpi=160, bbox_inches="tight")
plt.show()

This view is decision-oriented. A model can be noisy at the individual level yet still useful if it ranks high-benefit users reliably.

Segment-Level Checks

Segment checks reveal whether estimator behavior is stable across meaningful groups. This catches cases where a model looks good overall but fails in a segment that matters.

segment_definitions = {
    "high_onboarding_gap": test_df["onboarding_gap"] >= test_df["onboarding_gap"].median(),
    "high_prior_sessions": test_df["prior_sessions"] >= test_df["prior_sessions"].median(),
    "high_search_difficulty": test_df["search_difficulty"].fillna(train_medians["search_difficulty"]) >= test_df["search_difficulty"].fillna(train_medians["search_difficulty"]).median(),
    "high_price_sensitivity": test_df["price_sensitivity"] >= test_df["price_sensitivity"].median(),
    "high_motivation_segment": test_df["high_motivation_segment"] == 1,
    "organic_referral": test_df["referral_organic"] == 1,
    "mobile_primary": test_df["mobile_primary"] == 1,
}

segment_rows = []
for segment_name, mask in segment_definitions.items():
    mask_array = mask.to_numpy()
    for estimator_name in estimators:
        estimated = effect_predictions[estimator_name].to_numpy()
        segment_rows.append(
            {
                "segment": segment_name,
                "estimator": estimator_name,
                "users": int(mask_array.sum()),
                "true_cate_mean": true_tau_test[mask_array].mean(),
                "estimated_cate_mean": estimated[mask_array].mean(),
                "segment_bias": estimated[mask_array].mean() - true_tau_test[mask_array].mean(),
                "share_estimated_above_cost": np.mean(estimated[mask_array] > TREATMENT_COST),
            }
        )
segment_benchmark = pd.DataFrame(segment_rows)
segment_benchmark.to_csv(TABLE_DIR / f"{CASE_PREFIX}_segment_benchmark.csv", index=False)
display(segment_benchmark.head(15))

	segment	estimator	users	true_cate_mean	estimated_cate_mean	segment_bias	share_estimated_above_cost
0	high_onboarding_gap	LinearDML	825	0.4101	0.5078	0.0977	0.7527
1	high_onboarding_gap	CausalForestDML	825	0.4101	0.3854	-0.0247	0.8800
2	high_onboarding_gap	DRLearner	825	0.4101	0.4808	0.0707	0.7564
3	high_prior_sessions	LinearDML	825	0.2634	0.2984	0.0350	0.5927
4	high_prior_sessions	CausalForestDML	825	0.2634	0.2665	0.0031	0.6461
5	high_prior_sessions	DRLearner	825	0.2634	0.2666	0.0031	0.5697
6	high_search_difficulty	LinearDML	825	0.2798	0.3400	0.0602	0.6085
7	high_search_difficulty	CausalForestDML	825	0.2798	0.2913	0.0115	0.6812
8	high_search_difficulty	DRLearner	825	0.2798	0.3170	0.0372	0.6048
9	high_price_sensitivity	LinearDML	825	0.1520	0.1980	0.0460	0.5139
10	high_price_sensitivity	CausalForestDML	825	0.1520	0.2295	0.0775	0.5758
11	high_price_sensitivity	DRLearner	825	0.1520	0.2024	0.0504	0.4994
12	high_motivation_segment	LinearDML	852	0.3199	0.4238	0.1039	0.6878
13	high_motivation_segment	CausalForestDML	852	0.3199	0.2690	-0.0509	0.6455
14	high_motivation_segment	DRLearner	852	0.3199	0.3154	-0.0045	0.6092

Segment summaries show where estimated effects are systematically high or low. In real analysis, this would be a stability check rather than a truth-scoring table.

Segment Bias Heatmap

The heatmap makes segment-level average bias easy to scan across estimators.

segment_heatmap = segment_benchmark.pivot(index="segment", columns="estimator", values="segment_bias")
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.heatmap(segment_heatmap, annot=True, fmt=".3f", cmap="vlag", center=0, linewidths=0.5, ax=ax)
ax.set_title("Segment-Level Average CATE Bias")
ax.set_xlabel("")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_segment_bias_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

The heatmap separates global accuracy from segment reliability. A strong recommendation should mention both.

Cost-Aware Policy Targeting

The simplest policy treats a user when estimated CATE is greater than treatment cost. We compare that with two baselines: treat nobody and treat everybody.

def summarize_policy(policy_name, action, estimated_effect=None):
    action = action.astype(int)
    true_net = (true_tau_test - TREATMENT_COST) * action
    return {
        "policy": policy_name,
        "treatment_rate": action.mean(),
        "users_treated": int(action.sum()),
        "true_net_value_per_user": true_net.mean(),
        "true_gross_effect_per_user": (true_tau_test * action).mean(),
        "precision_positive_net": np.mean(true_tau_test[action == 1] > TREATMENT_COST) if action.sum() > 0 else np.nan,
        "mean_true_cate_among_treated": true_tau_test[action == 1].mean() if action.sum() > 0 else np.nan,
        "mean_estimated_cate_among_treated": estimated_effect[action == 1].mean() if estimated_effect is not None and action.sum() > 0 else np.nan,
    }

policy_rows = [
    summarize_policy("Treat none", np.zeros(len(test_df), dtype=bool)),
    summarize_policy("Treat all", np.ones(len(test_df), dtype=bool)),
]
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    policy_rows.append(summarize_policy(f"{estimator_name}: effect > cost", estimated > TREATMENT_COST, estimated))

policy_summary = pd.DataFrame(policy_rows).sort_values("true_net_value_per_user", ascending=False)
policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_policy_summary.csv", index=False)
display(policy_summary)

	policy	treatment_rate	users_treated	true_net_value_per_user	true_gross_effect_per_user	precision_positive_net	mean_true_cate_among_treated	mean_estimated_cate_among_treated
2	LinearDML: effect > cost	0.5394	890	0.1709	0.2679	0.8292	0.4968	0.6535
4	DRLearner: effect > cost	0.5188	856	0.1642	0.2575	0.8236	0.4964	0.5824
3	CausalForestDML: effect > cost	0.5558	917	0.1540	0.2540	0.7819	0.4571	0.3954
1	Treat all	1.0000	1650	0.0012	0.1812	0.4988	0.1812	NaN
0	Treat none	0.0000	0	0.0000	0.0000	NaN	NaN	NaN

The policy table is the handoff from estimation to action. A good CATE model should improve net value over simple baselines after treatment cost is included.

Policy Value Plot

The y-axis lists policies and the x-axis shows true net value per held-out user after subtracting treatment cost.

fig, ax = plt.subplots(figsize=(10, 5))
plot_policy = policy_summary.sort_values("true_net_value_per_user")
sns.barplot(data=plot_policy, x="true_net_value_per_user", y="policy", color="#22c55e", ax=ax)
ax.axvline(0, color="#111827", linewidth=1)
ax.set_title("Held-Out Net Policy Value")
ax.set_xlabel("True Net Value Per User")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_policy_value.png", dpi=160, bbox_inches="tight")
plt.show()

This is the most decision-facing plot in the notebook. It converts estimated effects into a concrete recommendation and compares that recommendation with obvious alternatives.

Budget-Constrained Targeting

A threshold rule is not the only practical choice. If there is a fixed budget, the team may treat only the highest estimated-effect users. This cell evaluates several budget levels.

budget_levels = [0.10, 0.20, 0.30, 0.40, 0.50]
budget_rows = []
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    for budget in budget_levels:
        cutoff = np.quantile(estimated, 1 - budget)
        action = estimated >= cutoff
        budget_rows.append(
            {
                "estimator": estimator_name,
                "budget_share": budget,
                "users_treated": int(action.sum()),
                "true_net_value_per_user": ((true_tau_test - TREATMENT_COST) * action).mean(),
                "mean_true_cate_among_treated": true_tau_test[action].mean(),
                "precision_positive_net": np.mean(true_tau_test[action] > TREATMENT_COST),
            }
        )
budget_curve = pd.DataFrame(budget_rows)
budget_curve.to_csv(TABLE_DIR / f"{CASE_PREFIX}_budget_curve.csv", index=False)
display(budget_curve.head(10))

	estimator	budget_share	users_treated	true_net_value_per_user	mean_true_cate_among_treated	precision_positive_net
0	LinearDML	0.1000	165	0.0786	0.9657	1.0000
1	LinearDML	0.2000	330	0.1222	0.7908	0.9939
2	LinearDML	0.3000	495	0.1507	0.6823	0.9636
3	LinearDML	0.4000	660	0.1671	0.5977	0.9182
4	LinearDML	0.5000	825	0.1715	0.5231	0.8570
5	CausalForestDML	0.1000	165	0.0749	0.9294	1.0000
6	CausalForestDML	0.2000	330	0.1163	0.7613	0.9727
7	CausalForestDML	0.3000	495	0.1396	0.6453	0.9131
8	CausalForestDML	0.4000	660	0.1530	0.5624	0.8682
9	CausalForestDML	0.5000	825	0.1565	0.4930	0.8133

Budget curves show whether an estimator remains useful as the program expands from a small high-confidence group to a broader population.

Budget Curve Plot

The curve shows net value as the share of treated users increases. Strong policies usually perform best at low budgets and then flatten as lower-benefit users are added.

fig, ax = plt.subplots(figsize=(9, 5))
sns.lineplot(data=budget_curve, x="budget_share", y="true_net_value_per_user", hue="estimator", marker="o", ax=ax)
ax.axhline(0, color="#111827", linewidth=1)
ax.set_title("Budget-Constrained Targeting Value")
ax.set_xlabel("Share Of Users Treated")
ax.set_ylabel("True Net Value Per User")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_budget_curve.png", dpi=160, bbox_inches="tight")
plt.show()

This curve helps choose an operating point. It also avoids implying that a model supports broad rollout when its value is concentrated in a small top segment.

Support-Aware Policy Filter

A conservative policy can restrict model-based targeting to users with diagnostic propensities between 0.10 and 0.90. This avoids acting in regions where observational support is thin.

support_mask = (overlap_frame["estimated_propensity"].to_numpy() >= 0.10) & (overlap_frame["estimated_propensity"].to_numpy() <= 0.90)
support_policy_rows = []
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    standard_action = estimated > TREATMENT_COST
    support_action = standard_action & support_mask
    support_policy_rows.append(
        {
            "estimator": estimator_name,
            "eligible_share": support_mask.mean(),
            "standard_treatment_rate": standard_action.mean(),
            "support_aware_treatment_rate": support_action.mean(),
            "standard_true_net_value": ((true_tau_test - TREATMENT_COST) * standard_action).mean(),
            "support_aware_true_net_value": ((true_tau_test - TREATMENT_COST) * support_action).mean(),
            "users_removed_by_support_filter": int((standard_action & ~support_mask).sum()),
        }
    )
support_policy_summary = pd.DataFrame(support_policy_rows).sort_values("support_aware_true_net_value", ascending=False)
support_policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_support_aware_policy_summary.csv", index=False)
display(support_policy_summary)

	estimator	eligible_share	standard_treatment_rate	support_aware_treatment_rate	standard_true_net_value	support_aware_true_net_value	users_removed_by_support_filter
0	LinearDML	0.9988	0.5394	0.5382	0.1709	0.1693	2
2	DRLearner	0.9988	0.5188	0.5176	0.1642	0.1626	2
1	CausalForestDML	0.9988	0.5558	0.5545	0.1540	0.1525	2

The support-aware version is more conservative. It may sacrifice some value in this simulation, but it is often a healthier recommendation when overlap is imperfect.

Bootstrap Uncertainty For Policy Value

The next cell bootstraps the held-out rows and recomputes policy value. This gives a lightweight uncertainty summary for policy comparisons.

bootstrap_rng = np.random.default_rng(RANDOM_STATE + 99)
N_BOOTSTRAPS = 300
policy_actions = {
    "Treat none": np.zeros(len(test_df), dtype=bool),
    "Treat all": np.ones(len(test_df), dtype=bool),
}
for estimator_name in estimators:
    policy_actions[f"{estimator_name}: effect > cost"] = effect_predictions[estimator_name].to_numpy() > TREATMENT_COST

bootstrap_rows = []
for draw in range(N_BOOTSTRAPS):
    sample_index = bootstrap_rng.choice(len(test_df), size=len(test_df), replace=True)
    for policy_name, action in policy_actions.items():
        value = ((true_tau_test[sample_index] - TREATMENT_COST) * action[sample_index]).mean()
        bootstrap_rows.append({"draw": draw, "policy": policy_name, "true_net_value_per_user": value})

bootstrap_policy = pd.DataFrame(bootstrap_rows)
bootstrap_policy_summary = (
    bootstrap_policy.groupby("policy")
    .agg(
        mean_value=("true_net_value_per_user", "mean"),
        lower_95=("true_net_value_per_user", lambda values: np.quantile(values, 0.025)),
        upper_95=("true_net_value_per_user", lambda values: np.quantile(values, 0.975)),
    )
    .reset_index()
    .sort_values("mean_value", ascending=False)
)
bootstrap_policy.to_csv(TABLE_DIR / f"{CASE_PREFIX}_bootstrap_policy_values.csv", index=False)
bootstrap_policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_bootstrap_policy_summary.csv", index=False)
display(bootstrap_policy_summary)

	policy	mean_value	lower_95	upper_95
2	LinearDML: effect > cost	0.1701	0.1560	0.1837
1	DRLearner: effect > cost	0.1635	0.1490	0.1773
0	CausalForestDML: effect > cost	0.1531	0.1386	0.1663
3	Treat all	0.0006	-0.0204	0.0228
4	Treat none	0.0000	0.0000	0.0000

The intervals show whether the apparent policy winner is clearly separated or only slightly ahead. That matters for honest reporting.

Bootstrap Policy Plot

This plot summarizes uncertainty around net policy value. Wider intervals mean the estimated policy value is more sensitive to which users are in the evaluation sample.

plot_bootstrap = bootstrap_policy_summary.sort_values("mean_value")
fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=plot_bootstrap["mean_value"],
    y=plot_bootstrap["policy"],
    xerr=[
        plot_bootstrap["mean_value"] - plot_bootstrap["lower_95"],
        plot_bootstrap["upper_95"] - plot_bootstrap["mean_value"],
    ],
    fmt="o",
    color="#2563eb",
    ecolor="#334155",
    capsize=4,
)
ax.axvline(0, color="#111827", linewidth=1)
ax.set_title("Bootstrap Uncertainty For Policy Value")
ax.set_xlabel("True Net Value Per User")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_bootstrap_policy_intervals.png", dpi=160, bbox_inches="tight")
plt.show()

The final recommendation should consider both point value and uncertainty. If intervals overlap heavily, the safer conclusion is that policies are practically similar until stronger validation is available.

Report-Ready Summary Table

The final comparison table combines estimation accuracy, targeting value, and runtime. It is designed to be readable outside the notebook.

policy_rows_for_estimators = policy_summary[policy_summary["policy"].str.contains(": effect > cost", regex=False)].copy()
policy_rows_for_estimators["estimator"] = policy_rows_for_estimators["policy"].str.replace(": effect > cost", "", regex=False)
report_summary = (
    benchmark_metrics[["estimator", "estimated_ate", "ate_bias", "cate_rmse", "cate_correlation", "top_20_true_cate_mean", "fit_seconds"]]
    .merge(
        policy_rows_for_estimators[["estimator", "treatment_rate", "true_net_value_per_user", "precision_positive_net"]],
        on="estimator",
    )
    .sort_values("true_net_value_per_user", ascending=False)
)
report_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_report_summary.csv", index=False)
display(report_summary)

	estimator	estimated_ate	ate_bias	cate_rmse	cate_correlation	top_20_true_cate_mean	fit_seconds	treatment_rate	true_net_value_per_user	precision_positive_net
0	LinearDML	0.2309	0.0497	0.2719	0.8948	0.7908	3.4164	0.5394	0.1709	0.8292
1	DRLearner	0.2099	0.0287	0.2720	0.8458	0.7515	3.9710	0.5188	0.1642	0.8236
2	CausalForestDML	0.2147	0.0336	0.3168	0.8173	0.7613	3.6490	0.5558	0.1540	0.7819

This table prevents the conclusion from relying on a single metric. A useful causal model must estimate effects, rank users, and support a decision that creates net value.

Written Case Study Recommendation

The next cell writes a compact markdown summary from the notebook outputs. It names the decision, the strongest policy, the evidence used, and the caveats.

best_policy = policy_summary.iloc[0]
best_estimator_by_rmse = benchmark_metrics.iloc[0]

report_lines = [
    "# End-To-End EconML Case Study Summary",
    "",
    "## Decision",
    "",
    f"Estimate which users should receive a proactive onboarding bundle when treatment cost is `{TREATMENT_COST:.2f}` outcome units per treated user.",
    "",
    "## Main Result",
    "",
    f"The strongest held-out policy by true net value is **{best_policy['policy']}**.",
    "",
    f"- Treatment rate: `{best_policy['treatment_rate']:.1%}`",
    f"- True net value per user: `{best_policy['true_net_value_per_user']:.4f}`",
    f"- Precision among treated users: `{best_policy['precision_positive_net']:.1%}`",
    "",
    f"The estimator with the lowest held-out CATE RMSE is **{best_estimator_by_rmse['estimator']}** with RMSE `{best_estimator_by_rmse['cate_rmse']:.4f}`.",
    "",
    "## Evidence Used",
    "",
    "The recommendation uses covariate balance checks, overlap diagnostics, nuisance-model diagnostics, held-out CATE recovery, segment bias checks, policy value comparisons, budget curves, and bootstrap policy intervals.",
    "",
    "## Caveats",
    "",
    "This is a teaching dataset with known ground truth. In real observational work, hidden true CATE would not be available, so the workflow would need domain review of confounders, sensitivity analysis, temporal validation, and ideally an experiment for policy confirmation.",
]
report_text = "\n".join(report_lines)
report_path = REPORT_DIR / f"{CASE_PREFIX}_case_study_summary.md"
report_path.write_text(report_text)
display(Markdown(report_text))
print(f"Saved report to: {report_path}")

End-To-End EconML Case Study Summary

Decision

Estimate which users should receive a proactive onboarding bundle when treatment cost is 0.18 outcome units per treated user.

Main Result

The strongest held-out policy by true net value is LinearDML: effect > cost.

Treatment rate: 53.9%
True net value per user: 0.1709
Precision among treated users: 82.9%

The estimator with the lowest held-out CATE RMSE is LinearDML with RMSE 0.2719.

Evidence Used

The recommendation uses covariate balance checks, overlap diagnostics, nuisance-model diagnostics, held-out CATE recovery, segment bias checks, policy value comparisons, budget curves, and bootstrap policy intervals.

Caveats

This is a teaching dataset with known ground truth. In real observational work, hidden true CATE would not be available, so the workflow would need domain review of confounders, sensitivity analysis, temporal validation, and ideally an experiment for policy confirmation.

Saved report to: /home/apex/Documents/ranking_sys/notebooks/tutorials/econml/outputs/reports/14_case_study_summary.md

The written summary is short because the notebook already contains the evidence. The recommendation is useful precisely because it pairs a policy result with diagnostics and caveats.

Final Checklist

The final checklist turns the notebook into a reusable workflow. In real analyses, missing checklist items should be treated as reasons for caution.

case_study_checklist = pd.DataFrame(
    [
        ("Decision defined", "Treatment, outcome, cost, and target population are explicit."),
        ("Pre-treatment variables separated", "Effect modifiers and controls are assigned before fitting."),
        ("Missingness handled", "Imputation is learned from training data only."),
        ("Confounding diagnosed", "Raw gaps, balance, and propensity overlap are reviewed."),
        ("Nuisance models checked", "Treatment and outcome models have held-out diagnostics."),
        ("Multiple estimators compared", "Results are not tied to one estimator by default."),
        ("Policy value evaluated", "CATE estimates are translated into net decision value."),
        ("Support considered", "Weak-overlap regions are handled cautiously."),
        ("Uncertainty shown", "Bootstrap intervals or comparable uncertainty summaries are included."),
        ("Caveats documented", "Assumptions and validation needs are written plainly."),
    ],
    columns=["check", "what_good_reporting_should_include"],
)
case_study_checklist.to_csv(TABLE_DIR / f"{CASE_PREFIX}_case_study_checklist.csv", index=False)
display(case_study_checklist)

	check	what_good_reporting_should_include
0	Decision defined	Treatment, outcome, cost, and target populatio...
1	Pre-treatment variables separated	Effect modifiers and controls are assigned bef...
2	Missingness handled	Imputation is learned from training data only.
3	Confounding diagnosed	Raw gaps, balance, and propensity overlap are ...
4	Nuisance models checked	Treatment and outcome models have held-out dia...
5	Multiple estimators compared	Results are not tied to one estimator by default.
6	Policy value evaluated	CATE estimates are translated into net decisio...
7	Support considered	Weak-overlap regions are handled cautiously.
8	Uncertainty shown	Bootstrap intervals or comparable uncertainty ...
9	Caveats documented	Assumptions and validation needs are written p...

The checklist closes the loop. A strong EconML workflow is not only estimator fitting; it is design, diagnostics, estimation, decision translation, and careful reporting.

What Comes Next

The final notebook in this EconML tutorial sequence should cover common pitfalls, debugging, and reporting. This case study used a clean teaching setup; real applied analyses are more likely to fail through leakage, weak overlap, invalid controls, unstable nuisance models, or overconfident CATE storytelling.