EconML Tutorial 14: End-To-End Case Study

This notebook is the capstone workflow for the EconML tutorial sequence. Earlier notebooks focused on individual estimators, diagnostics, or decision steps. Here we connect those pieces into one applied causal analysis: define the decision, inspect the observational data, check assumptions, fit nuisance models, estimate heterogeneous effects, translate effects into a targeting policy, and prepare results that are easy to explain.

The setting is synthetic but realistic: a digital product team is deciding which users should receive a proactive onboarding bundle. Because the data are simulated, we know the true individual treatment effects. That hidden truth is used only for teaching and validation; it is never used to fit the EconML estimators.

Learning Goals

By the end of this notebook, you should be able to:

  • Convert an applied decision into treatment, outcome, controls, effect modifiers, and target estimand.
  • Run data-quality, missingness, balance, and overlap checks before fitting CATE models.
  • Prepare X and W matrices for EconML without leaking test information into training.
  • Fit LinearDML, CausalForestDML, and DRLearner on the same design.
  • Compare ATE bias, CATE recovery, segment behavior, targeting value, budget curves, and runtime.
  • Turn CATE estimates into a practical policy recommendation with clear limitations.

Notebook Flow

The case study follows a reusable analysis pattern:

  1. Define the decision and causal roles.
  2. Generate a realistic observational dataset with known hidden truth.
  3. Inspect field definitions, missingness, raw gaps, imbalance, and overlap.
  4. Build train/test matrices with clean imputation.
  5. Fit nuisance models and EconML treatment-effect estimators.
  6. Evaluate held-out CATE recovery and segment stability.
  7. Translate estimates into cost-aware targeting policies.
  8. Save report-ready tables, figures, and a short written recommendation.

Setup

The first code cell imports libraries, configures plotting, suppresses noisy implementation warnings, and creates output folders. MPLCONFIGDIR is set before importing Matplotlib so notebook execution does not create cache warnings on machines where the default Matplotlib config directory is not writable.

from pathlib import Path
import os
import time
import warnings

NOTEBOOK_DIR = Path.cwd()
if NOTEBOOK_DIR.name != "econml":
    NOTEBOOK_DIR = Path("notebooks/tutorials/econml").resolve()

OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
for directory in [FIGURE_DIR, TABLE_DIR, REPORT_DIR, OUTPUT_DIR / "matplotlib_cache"]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str((OUTPUT_DIR / "matplotlib_cache").resolve()))

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*IProgress.*")
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")
warnings.filterwarnings("ignore", message=".*column-vector y was passed.*")
warnings.filterwarnings("ignore", message=".*lbfgs failed to converge.*")
warnings.filterwarnings("ignore", message=".*invalid value encountered.*")
warnings.filterwarnings("ignore", message=".*Co-variance matrix is underdetermined.*")

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from scipy.special import expit
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

from econml.dml import CausalForestDML, LinearDML
from econml.dr import DRLearner

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.dpi"] = 120
pd.set_option("display.max_columns", 80)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

The output directories are now ready. Every artifact created by this notebook uses a 14_ prefix so it can be separated from the earlier tutorial outputs.

Reproducibility Settings

This notebook uses random simulation, train-test splitting, tree-based nuisance models, and bootstrap resampling. A fixed seed makes the results stable across runs while still giving us a realistic observational dataset.

RANDOM_STATE = 20260430
rng = np.random.default_rng(RANDOM_STATE)
CASE_PREFIX = "14"
N_USERS = 5_500
TEST_SIZE = 0.30
TREATMENT_COST = 0.18

settings = pd.DataFrame(
    {
        "setting": ["random_state", "n_users", "test_size", "treatment_cost"],
        "value": [RANDOM_STATE, N_USERS, TEST_SIZE, TREATMENT_COST],
        "why_it_matters": [
            "Makes simulation, model fitting, and bootstrap resampling reproducible.",
            "Keeps the notebook fast while leaving enough data for heterogeneous-effect diagnostics.",
            "Creates a held-out sample for estimator comparison and policy evaluation.",
            "Converts gross CATE into net decision value for targeting.",
        ],
    }
)
settings.to_csv(TABLE_DIR / f"{CASE_PREFIX}_analysis_settings.csv", index=False)
display(settings)
setting value why_it_matters
0 random_state 20,260,430.0000 Makes simulation, model fitting, and bootstrap...
1 n_users 5,500.0000 Keeps the notebook fast while leaving enough d...
2 test_size 0.3000 Creates a held-out sample for estimator compar...
3 treatment_cost 0.1800 Converts gross CATE into net decision value fo...

The treatment cost is measured in the same units as the outcome. A user should be targeted only if the expected incremental value is larger than this cost. That distinction turns effect estimation into a decision problem.

Product Decision And Causal Roles

The hypothetical decision is whether a user should receive a proactive onboarding bundle during their first week. The bundle includes extra guidance, recommended paths, and reminders. The outcome is engagement value over the next eight weeks.

The data are observational. Higher-need and higher-potential users are more likely to receive the bundle, so raw treated-control differences will mix treatment impact with selection bias.

role_map = pd.DataFrame(
    [
        ("Treatment T", "onboarding_bundle", "Whether the user received the proactive onboarding bundle.", "T"),
        ("Outcome Y", "eight_week_value", "Continuous engagement value after the treatment window.", "Y"),
        ("Effect modifiers X", "need, friction, preference, and reliability fields", "Variables that may change how much the bundle helps.", "X"),
        ("Controls W", "tenure, market, source, and operational fields", "Observed confounders adjusted for in nuisance models.", "W"),
        ("Hidden teaching truth", "true_cate and true_propensity", "Known only because this is a simulation; never used for fitting.", "Evaluation only"),
    ],
    columns=["role", "field", "meaning", "used_by_econml_as"],
)
role_map.to_csv(TABLE_DIR / f"{CASE_PREFIX}_causal_role_map.csv", index=False)
display(role_map)
role field meaning used_by_econml_as
0 Treatment T onboarding_bundle Whether the user received the proactive onboar... T
1 Outcome Y eight_week_value Continuous engagement value after the treatmen... Y
2 Effect modifiers X need, friction, preference, and reliability fi... Variables that may change how much the bundle ... X
3 Controls W tenure, market, source, and operational fields Observed confounders adjusted for in nuisance ... W
4 Hidden teaching truth true_cate and true_propensity Known only because this is a simulation; never... Evaluation only

This role table is the first guardrail. EconML is an estimation library; it still depends on the analyst to define treatment, outcome, confounders, and effect modifiers before fitting anything.

Data-Generating Process

The helper functions below create three hidden pieces of the synthetic world:

  • A true CATE function that determines how much each user benefits from onboarding.
  • A targeted treatment assignment process that creates confounding.
  • A baseline outcome function that determines engagement even without treatment.

The true CATE is nonlinear on purpose, so the case study can test whether flexible estimators are useful.

def make_correlated_normals(n_rows, n_features, correlation, rng):
    covariance = np.full((n_features, n_features), correlation)
    np.fill_diagonal(covariance, 1.0)
    return rng.multivariate_normal(np.zeros(n_features), covariance, size=n_rows)


def true_cate_function(frame):
    return (
        0.26
        + 0.28 * frame["onboarding_gap"]
        + 0.20 * frame["content_diversity"]
        - 0.22 * frame["support_contacts"]
        - 0.16 * frame["price_sensitivity"]
        + 0.18 * frame["high_motivation_segment"]
        + 0.15 * np.maximum(frame["search_difficulty"], 0)
        + 0.10 * frame["device_reliability"] * frame["content_diversity"]
        - 0.12 * frame["region_competition"]
    )


def treatment_propensity_function(frame):
    logit = (
        -0.20
        + 0.55 * frame["onboarding_gap"]
        + 0.42 * frame["prior_sessions"]
        + 0.35 * frame["high_motivation_segment"]
        + 0.30 * frame["content_diversity"]
        + 0.24 * frame["marketing_pressure"]
        - 0.28 * frame["device_reliability"]
        + 0.18 * frame["support_contacts"]
        - 0.16 * frame["referral_organic"]
    )
    return np.clip(expit(logit), 0.05, 0.95)


def baseline_outcome_function(frame):
    return (
        1.40
        + 0.45 * frame["prior_sessions"]
        + 0.35 * frame["content_diversity"]
        - 0.32 * frame["support_contacts"]
        - 0.28 * frame["search_difficulty"]
        - 0.22 * frame["price_sensitivity"]
        + 0.24 * frame["device_reliability"]
        + 0.18 * frame["market_maturity"]
        + 0.12 * frame["referral_organic"]
        + 0.08 * frame["weekday_signup"]
        + 0.10 * frame["account_complexity"]
        + 0.05 * np.sin(frame["tenure_weeks"] / 3.0)
    )

The CATE function says that onboarding helps users more when they have clear onboarding gaps, diverse interests, and search friction. It helps less when support burden and price sensitivity are high. That gives us heterogeneous effects with an intuitive story.

Generate The User-Level Table

The next cell creates the observed dataset and inserts modest missingness into a few pre-treatment fields. Missingness makes the preparation step more realistic without turning the tutorial into a data-cleaning exercise.

latent = make_correlated_normals(N_USERS, 6, correlation=0.28, rng=rng)
users = pd.DataFrame(
    {
        "user_id": np.arange(1, N_USERS + 1),
        "onboarding_gap": latent[:, 0],
        "prior_sessions": latent[:, 1],
        "content_diversity": latent[:, 2],
        "search_difficulty": latent[:, 3],
        "price_sensitivity": latent[:, 4],
        "device_reliability": latent[:, 5],
    }
)

users["support_contacts"] = np.clip(rng.poisson(np.exp(0.10 + 0.25 * users["search_difficulty"]), N_USERS), 0, 6)
users["tenure_weeks"] = np.clip(rng.gamma(shape=3.0, scale=3.0, size=N_USERS), 1, 40)
users["region_competition"] = rng.normal(0, 1, N_USERS)
users["market_maturity"] = rng.normal(0, 1, N_USERS)
users["marketing_pressure"] = rng.normal(0, 1, N_USERS)
users["account_complexity"] = rng.normal(0, 1, N_USERS)
users["referral_organic"] = rng.binomial(1, expit(0.15 + 0.25 * users["content_diversity"]), N_USERS)
users["weekday_signup"] = rng.binomial(1, 0.72, N_USERS)
users["mobile_primary"] = rng.binomial(1, expit(0.05 - 0.20 * users["device_reliability"]), N_USERS)
users["high_motivation_segment"] = rng.binomial(
    1,
    expit(0.35 * users["prior_sessions"] + 0.25 * users["content_diversity"] - 0.10 * users["price_sensitivity"]),
    N_USERS,
)

users["true_cate"] = true_cate_function(users)
users["true_propensity"] = treatment_propensity_function(users)
users["onboarding_bundle"] = rng.binomial(1, users["true_propensity"], N_USERS)
users["baseline_value"] = baseline_outcome_function(users)
users["eight_week_value"] = users["baseline_value"] + users["onboarding_bundle"] * users["true_cate"] + rng.normal(0, 0.75, N_USERS)

for column, missing_rate in {
    "content_diversity": 0.035,
    "search_difficulty": 0.025,
    "support_contacts": 0.020,
    "device_reliability": 0.030,
}.items():
    users.loc[rng.random(N_USERS) < missing_rate, column] = np.nan

users.to_csv(TABLE_DIR / f"{CASE_PREFIX}_case_study_teaching_data.csv", index=False)
display(users.head())
user_id onboarding_gap prior_sessions content_diversity search_difficulty price_sensitivity device_reliability support_contacts tenure_weeks region_competition market_maturity marketing_pressure account_complexity referral_organic weekday_signup mobile_primary high_motivation_segment true_cate true_propensity onboarding_bundle baseline_value eight_week_value
0 1 0.7118 0.0075 1.2556 0.0693 -0.3601 0.6739 0.0000 10.3103 -1.2263 -1.1327 0.4568 -0.3511 0 0 0 1 1.1902 0.6989 1 1.8108 2.7246
1 2 -0.0587 -0.3216 -0.1668 1.6010 -0.8754 -0.8493 2.0000 13.5698 0.6656 -0.1502 -0.3617 -0.2442 0 0 0 1 0.2647 0.6091 0 -0.0032 0.6347
2 3 1.8492 0.5357 0.2346 1.0726 0.4464 0.9111 1.0000 4.8698 0.4456 0.0216 0.8945 -0.5806 0 0 1 0 0.6621 0.7776 1 1.2190 1.8847
3 4 0.0469 0.0257 0.6607 2.2414 -0.5130 -0.9861 1.0000 4.3458 -1.5559 -0.7411 -0.9080 -0.7013 1 1 1 1 0.9051 0.6137 0 0.6175 0.0734
4 5 0.2438 1.3964 -1.3551 0.1775 -0.9877 -1.0327 1.0000 17.7193 0.8292 0.0023 2.4848 -0.0242 0 1 0 0 0.0623 0.7649 1 1.2134 0.9457

The dataset now contains the observed fields an analyst would use and the hidden truth fields used for teaching checks. In a real analysis, only observed pre-treatment fields, treatment, and outcome would be available.

Field Dictionary

A field dictionary makes the data roles auditable. It also helps avoid using post-treatment or hidden teaching fields accidentally in the model matrix.

field_dictionary = pd.DataFrame(
    [
        ("user_id", "Identifier", "Row identifier used only for tracking.", "Observed", "Index only"),
        ("onboarding_gap", "Numeric", "How much help the user appears to need during onboarding.", "Observed", "X"),
        ("prior_sessions", "Numeric", "Early usage intensity before the bundle decision.", "Observed", "X"),
        ("content_diversity", "Numeric", "Breadth of content interests visible before treatment.", "Observed", "X"),
        ("search_difficulty", "Numeric", "Signal that the user struggles to find useful content.", "Observed", "X"),
        ("support_contacts", "Numeric", "Number of early support interactions before treatment.", "Observed", "X"),
        ("price_sensitivity", "Numeric", "Proxy for whether cost concerns limit engagement.", "Observed", "X"),
        ("device_reliability", "Numeric", "Proxy for whether technical quality enables benefit.", "Observed", "X"),
        ("high_motivation_segment", "Binary", "Pre-treatment segment flag from early intent signals.", "Observed", "X"),
        ("tenure_weeks", "Numeric", "Account age at decision time.", "Observed", "W"),
        ("region_competition", "Numeric", "Local competitive intensity proxy.", "Observed", "W"),
        ("market_maturity", "Numeric", "Market environment quality proxy.", "Observed", "W"),
        ("marketing_pressure", "Numeric", "Recent non-bundle marketing pressure.", "Observed", "W"),
        ("account_complexity", "Numeric", "Operational complexity proxy.", "Observed", "W"),
        ("referral_organic", "Binary", "Whether acquisition source was organic.", "Observed", "W"),
        ("weekday_signup", "Binary", "Whether signup happened on a weekday.", "Observed", "W"),
        ("mobile_primary", "Binary", "Whether early activity was mostly mobile.", "Observed", "W"),
        ("onboarding_bundle", "Binary", "Treatment assignment.", "Observed", "T"),
        ("eight_week_value", "Numeric", "Post-treatment outcome.", "Observed", "Y"),
        ("true_cate", "Numeric", "Known individual treatment effect for teaching evaluation.", "Hidden in real work", "Evaluation only"),
        ("true_propensity", "Numeric", "Known treatment probability for teaching evaluation.", "Hidden in real work", "Evaluation only"),
        ("baseline_value", "Numeric", "Known untreated outcome component.", "Hidden in real work", "Evaluation only"),
    ],
    columns=["field", "type", "description", "availability", "analysis_role"],
)
field_dictionary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_field_dictionary.csv", index=False)
display(field_dictionary)
field type description availability analysis_role
0 user_id Identifier Row identifier used only for tracking. Observed Index only
1 onboarding_gap Numeric How much help the user appears to need during ... Observed X
2 prior_sessions Numeric Early usage intensity before the bundle decision. Observed X
3 content_diversity Numeric Breadth of content interests visible before tr... Observed X
4 search_difficulty Numeric Signal that the user struggles to find useful ... Observed X
5 support_contacts Numeric Number of early support interactions before tr... Observed X
6 price_sensitivity Numeric Proxy for whether cost concerns limit engagement. Observed X
7 device_reliability Numeric Proxy for whether technical quality enables be... Observed X
8 high_motivation_segment Binary Pre-treatment segment flag from early intent s... Observed X
9 tenure_weeks Numeric Account age at decision time. Observed W
10 region_competition Numeric Local competitive intensity proxy. Observed W
11 market_maturity Numeric Market environment quality proxy. Observed W
12 marketing_pressure Numeric Recent non-bundle marketing pressure. Observed W
13 account_complexity Numeric Operational complexity proxy. Observed W
14 referral_organic Binary Whether acquisition source was organic. Observed W
15 weekday_signup Binary Whether signup happened on a weekday. Observed W
16 mobile_primary Binary Whether early activity was mostly mobile. Observed W
17 onboarding_bundle Binary Treatment assignment. Observed T
18 eight_week_value Numeric Post-treatment outcome. Observed Y
19 true_cate Numeric Known individual treatment effect for teaching... Hidden in real work Evaluation only
20 true_propensity Numeric Known treatment probability for teaching evalu... Hidden in real work Evaluation only
21 baseline_value Numeric Known untreated outcome component. Hidden in real work Evaluation only

The hidden teaching fields are clearly labeled as evaluation-only. This habit matters because accidental leakage can make causal models look much better than they are.

Basic Shape And Missingness

The first data check confirms the number of rows, treatment/control counts, and missing-value rates. Treatment and outcome must be complete for the simple workflow used here.

basic_shape = pd.DataFrame(
    {
        "metric": ["rows", "columns", "treated_rows", "control_rows", "outcome_missing", "treatment_missing"],
        "value": [
            len(users),
            users.shape[1],
            int(users["onboarding_bundle"].sum()),
            int((1 - users["onboarding_bundle"]).sum()),
            int(users["eight_week_value"].isna().sum()),
            int(users["onboarding_bundle"].isna().sum()),
        ],
    }
)
missing_summary = (
    users.isna().mean().rename("missing_rate").reset_index().rename(columns={"index": "field"})
    .query("missing_rate > 0")
    .sort_values("missing_rate", ascending=False)
)
basic_shape.to_csv(TABLE_DIR / f"{CASE_PREFIX}_basic_shape.csv", index=False)
missing_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_missing_summary.csv", index=False)
display(basic_shape)
display(missing_summary)
metric value
0 rows 5500
1 columns 22
2 treated_rows 2890
3 control_rows 2610
4 outcome_missing 0
5 treatment_missing 0
field missing_rate
3 content_diversity 0.0322
6 device_reliability 0.0278
4 search_difficulty 0.0240
7 support_contacts 0.0215

Treatment and outcome are complete. A few pre-treatment fields have light missingness, so the modeling section will impute them using train-only medians and add missingness flags.

Raw Treated-Control Difference

A raw outcome comparison is a useful descriptive baseline, but it is not a causal estimate when treatment assignment is targeted. This cell compares the observed raw gap to the true average effect known from the simulation.

raw_group_summary = (
    users.groupby("onboarding_bundle")
    .agg(
        users=("user_id", "size"),
        observed_mean_outcome=("eight_week_value", "mean"),
        true_mean_cate=("true_cate", "mean"),
        mean_true_propensity=("true_propensity", "mean"),
    )
    .reset_index()
)
raw_difference = raw_group_summary.loc[raw_group_summary["onboarding_bundle"] == 1, "observed_mean_outcome"].iloc[0] - raw_group_summary.loc[raw_group_summary["onboarding_bundle"] == 0, "observed_mean_outcome"].iloc[0]
true_population_ate = users["true_cate"].mean()
raw_vs_truth = pd.DataFrame(
    {
        "quantity": ["raw_treated_minus_control", "true_population_ate", "raw_minus_true_ate"],
        "value": [raw_difference, true_population_ate, raw_difference - true_population_ate],
    }
)
raw_group_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_raw_group_summary.csv", index=False)
raw_vs_truth.to_csv(TABLE_DIR / f"{CASE_PREFIX}_raw_difference_vs_truth.csv", index=False)
display(raw_group_summary)
display(raw_vs_truth)
onboarding_bundle users observed_mean_outcome true_mean_cate mean_true_propensity
0 0 2610 1.0686 0.0881 0.4417
1 1 2890 1.5806 0.2930 0.5946
quantity value
0 raw_treated_minus_control 0.5120
1 true_population_ate 0.1958
2 raw_minus_true_ate 0.3163

The raw gap is not the same as the true average effect. The treated users also have a different average assignment propensity, which is direct evidence of selection into treatment.

True Effect Distribution For Teaching

Because the data are simulated, we can inspect the true treatment-effect distribution. This is not available in real work, but it helps explain why heterogeneous-effect modeling is useful.

true_effect_summary = users["true_cate"].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).to_frame("true_cate")
true_effect_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_true_effect_summary.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(users["true_cate"], bins=45, kde=True, color="#2563eb", ax=ax)
ax.axvline(TREATMENT_COST, color="#dc2626", linestyle="--", linewidth=1.6, label=f"Treatment cost = {TREATMENT_COST:.2f}")
ax.axvline(users["true_cate"].mean(), color="#111827", linewidth=1.3, label="True mean CATE")
ax.set_title("True Heterogeneous Treatment Effects")
ax.set_xlabel("True CATE")
ax.set_ylabel("Users")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_true_cate_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

display(true_effect_summary)

true_cate
count 5,500.0000
mean 0.1958
std 0.4820
min -1.5732
5% -0.5931
25% -0.1223
50% 0.1933
75% 0.5051
95% 1.0103
max 2.0447

The distribution has meaningful spread around the average. Some users are much more attractive to target than others once cost is considered.

Covariate Balance Before Adjustment

Standardized mean differences compare treated and control groups on pre-treatment variables. Large absolute values indicate that treatment assignment is related to user characteristics.

OBSERVED_FEATURES = [
    "onboarding_gap", "prior_sessions", "content_diversity", "search_difficulty", "support_contacts",
    "price_sensitivity", "device_reliability", "high_motivation_segment", "tenure_weeks",
    "region_competition", "market_maturity", "marketing_pressure", "account_complexity",
    "referral_organic", "weekday_signup", "mobile_primary",
]

balance_rows = []
for feature in OBSERVED_FEATURES:
    treated = users.loc[users["onboarding_bundle"] == 1, feature]
    control = users.loc[users["onboarding_bundle"] == 0, feature]
    pooled_sd = np.sqrt((treated.var(skipna=True) + control.var(skipna=True)) / 2)
    smd = (treated.mean(skipna=True) - control.mean(skipna=True)) / pooled_sd if pooled_sd > 0 else 0.0
    balance_rows.append(
        {
            "feature": feature,
            "treated_mean": treated.mean(skipna=True),
            "control_mean": control.mean(skipna=True),
            "standardized_mean_difference": smd,
            "abs_smd": abs(smd),
        }
    )
balance_table = pd.DataFrame(balance_rows).sort_values("abs_smd", ascending=False)
balance_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_covariate_balance.csv", index=False)
display(balance_table.head(12))
feature treated_mean control_mean standardized_mean_difference abs_smd
0 onboarding_gap 0.2406 -0.2843 0.5503 0.5503
1 prior_sessions 0.2548 -0.2455 0.5121 0.5121
2 content_diversity 0.2235 -0.2174 0.4537 0.4537
3 search_difficulty 0.1248 -0.1917 0.3176 0.3176
7 high_motivation_segment 0.5651 0.4406 0.2508 0.2508
5 price_sensitivity 0.1005 -0.1055 0.2071 0.2071
11 marketing_pressure 0.1125 -0.0906 0.2062 0.2062
4 support_contacts 1.2205 1.0035 0.2036 0.2036
6 device_reliability 0.0726 -0.0835 0.1549 0.1549
14 weekday_signup 0.7087 0.7356 -0.0603 0.0603
12 account_complexity -0.0402 0.0069 -0.0469 0.0469
9 region_competition -0.0289 0.0049 -0.0343 0.0343

The largest imbalances identify why the raw comparison is risky. Several variables that influence treatment also influence outcomes.

Balance Plot

The balance plot makes imbalance easier to scan. Dashed reference lines at +0.10 and -0.10 mark a common practical threshold for variables worth investigating.

fig, ax = plt.subplots(figsize=(9, 6))
plot_balance = balance_table.sort_values("standardized_mean_difference")
sns.barplot(data=plot_balance, x="standardized_mean_difference", y="feature", color="#60a5fa", ax=ax)
ax.axvline(0, color="#111827", linewidth=1)
ax.axvline(0.10, color="#dc2626", linestyle="--", linewidth=1)
ax.axvline(-0.10, color="#dc2626", linestyle="--", linewidth=1)
ax.set_title("Pre-Treatment Covariate Imbalance")
ax.set_xlabel("Standardized Mean Difference")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_covariate_balance.png", dpi=160, bbox_inches="tight")
plt.show()

The imbalance pattern is exactly what motivates causal adjustment. The assignment rule has targeted users with different pre-treatment profiles.

Propensity Overlap Diagnostic

Overlap means similar users can be observed both treated and untreated. If estimated propensity scores are extremely close to zero or one, CATE estimates become more dependent on extrapolation.

propensity_frame = users[OBSERVED_FEATURES].fillna(users[OBSERVED_FEATURES].median())
diagnostic_propensity_model = LogisticRegression(max_iter=1_000, solver="lbfgs")
diagnostic_propensity_model.fit(propensity_frame, users["onboarding_bundle"])
users["diagnostic_propensity"] = diagnostic_propensity_model.predict_proba(propensity_frame)[:, 1]

propensity_summary = users["diagnostic_propensity"].describe(percentiles=[0.01, 0.05, 0.10, 0.50, 0.90, 0.95, 0.99]).to_frame("diagnostic_propensity")
propensity_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_diagnostic_propensity_summary.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.kdeplot(data=users, x="diagnostic_propensity", hue="onboarding_bundle", common_norm=False, fill=True, alpha=0.25, ax=ax)
ax.axvline(0.05, color="#dc2626", linestyle="--", linewidth=1)
ax.axvline(0.95, color="#dc2626", linestyle="--", linewidth=1)
ax.set_title("Estimated Propensity Overlap")
ax.set_xlabel("Diagnostic Propensity Score")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_diagnostic_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

display(propensity_summary)

diagnostic_propensity
count 5,500.0000
mean 0.5255
std 0.1915
min 0.0563
1% 0.1259
5% 0.2071
10% 0.2619
50% 0.5312
90% 0.7811
95% 0.8259
99% 0.9010
max 0.9715

Overlap is usable but not perfect. That means the analysis can proceed, while later recommendations should still be cautious around extreme-propensity users.

Train-Test Split

The estimators are fit on the training set and evaluated on a held-out set. Stratifying by treatment keeps the treated/control share similar in both samples.

train_df, test_df = train_test_split(
    users,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=users["onboarding_bundle"],
)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

split_summary = pd.DataFrame(
    {
        "split": ["train", "test"],
        "rows": [len(train_df), len(test_df)],
        "treatment_rate": [train_df["onboarding_bundle"].mean(), test_df["onboarding_bundle"].mean()],
        "mean_outcome": [train_df["eight_week_value"].mean(), test_df["eight_week_value"].mean()],
        "mean_true_cate": [train_df["true_cate"].mean(), test_df["true_cate"].mean()],
    }
)
split_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_train_test_split_summary.csv", index=False)
display(split_summary)
split rows treatment_rate mean_outcome mean_true_cate
0 train 3850 0.5255 1.3454 0.2020
1 test 1650 0.5255 1.3194 0.1812

The splits are comparable in size, treatment rate, and average true effect. That creates a stable evaluation setup for the rest of the notebook.

Feature Roles For EconML

X contains effect modifiers: variables we want the CATE function to vary with and variables a targeting rule may use. W contains additional controls used for adjustment in nuisance models.

X_COLUMNS = [
    "onboarding_gap", "prior_sessions", "content_diversity", "search_difficulty",
    "support_contacts", "price_sensitivity", "device_reliability", "high_motivation_segment",
]
W_COLUMNS = [
    "tenure_weeks", "region_competition", "market_maturity", "marketing_pressure",
    "account_complexity", "referral_organic", "weekday_signup", "mobile_primary",
]

feature_role_table = pd.DataFrame(
    {
        "field": X_COLUMNS + W_COLUMNS,
        "econml_role": ["X"] * len(X_COLUMNS) + ["W"] * len(W_COLUMNS),
    }
)
feature_role_table["reason"] = [
    "May directly change how much help onboarding provides.",
    "Early activity can modify treatment value.",
    "Breadth of interests can make guided discovery more useful.",
    "Search difficulty is a clear need signal.",
    "Support burden may limit incremental benefit.",
    "Cost concerns can reduce realized impact.",
    "Technical reliability can enable or block benefit.",
    "Motivation segment is an actionable effect modifier.",
    "Account age adjusts for maturity at decision time.",
    "Market competition affects both assignment and outcome.",
    "Market maturity adjusts for environment quality.",
    "Other marketing pressure can confound outcomes.",
    "Operational complexity adjusts for baseline value.",
    "Acquisition source can confound early engagement.",
    "Signup day controls timing patterns.",
    "Primary device controls experience differences.",
]
feature_role_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_feature_role_table.csv", index=False)
display(feature_role_table)
field econml_role reason
0 onboarding_gap X May directly change how much help onboarding p...
1 prior_sessions X Early activity can modify treatment value.
2 content_diversity X Breadth of interests can make guided discovery...
3 search_difficulty X Search difficulty is a clear need signal.
4 support_contacts X Support burden may limit incremental benefit.
5 price_sensitivity X Cost concerns can reduce realized impact.
6 device_reliability X Technical reliability can enable or block bene...
7 high_motivation_segment X Motivation segment is an actionable effect mod...
8 tenure_weeks W Account age adjusts for maturity at decision t...
9 region_competition W Market competition affects both assignment and...
10 market_maturity W Market maturity adjusts for environment quality.
11 marketing_pressure W Other marketing pressure can confound outcomes.
12 account_complexity W Operational complexity adjusts for baseline va...
13 referral_organic W Acquisition source can confound early engagement.
14 weekday_signup W Signup day controls timing patterns.
15 mobile_primary W Primary device controls experience differences.

This split keeps the CATE surface focused on user-need and product-experience fields. The controls still adjust for confounding, but they are not the main segmentation story.

Train-Only Imputation And Matrix Construction

Missing values are imputed using training medians only. Missingness flags are placed in W, allowing nuisance models to learn whether missingness itself is predictive without making the CATE surface harder to explain.

ALL_MODEL_COLUMNS = X_COLUMNS + W_COLUMNS
train_medians = train_df[ALL_MODEL_COLUMNS].median(numeric_only=True)
MISSING_FLAG_COLUMNS = [column for column in ALL_MODEL_COLUMNS if train_df[column].isna().any() or test_df[column].isna().any()]


def build_design_matrices(frame, medians):
    raw_features = frame[ALL_MODEL_COLUMNS].copy()
    missing_flags = raw_features[MISSING_FLAG_COLUMNS].isna().astype(float).add_suffix("_missing")
    filled_features = raw_features.fillna(medians)
    X = filled_features[X_COLUMNS].astype(float)
    W = pd.concat([filled_features[W_COLUMNS].astype(float), missing_flags], axis=1)
    return X, W

X_train, W_train = build_design_matrices(train_df, train_medians)
X_test, W_test = build_design_matrices(test_df, train_medians)
T_train = train_df["onboarding_bundle"].astype(int)
T_test = test_df["onboarding_bundle"].astype(int)
Y_train = train_df["eight_week_value"].astype(float)
Y_test = test_df["eight_week_value"].astype(float)
true_tau_test = test_df["true_cate"].to_numpy()

matrix_summary = pd.DataFrame(
    {
        "matrix": ["X_train", "W_train", "X_test", "W_test"],
        "rows": [X_train.shape[0], W_train.shape[0], X_test.shape[0], W_test.shape[0]],
        "columns": [X_train.shape[1], W_train.shape[1], X_test.shape[1], W_test.shape[1]],
        "missing_values_after_preparation": [
            int(X_train.isna().sum().sum()), int(W_train.isna().sum().sum()),
            int(X_test.isna().sum().sum()), int(W_test.isna().sum().sum()),
        ],
    }
)
matrix_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_model_matrix_summary.csv", index=False)
display(matrix_summary)
matrix rows columns missing_values_after_preparation
0 X_train 3850 8 0
1 W_train 3850 12 0
2 X_test 1650 8 0
3 W_test 1650 12 0

The prepared matrices contain no missing values. The key practice is that test rows did not influence the imputation values used for training.

Nuisance Model Diagnostics

Treatment-effect estimators rely on nuisance models for treatment assignment and outcomes. These diagnostics check whether observed pre-treatment features contain enough predictive signal to support adjustment.

def make_rf_classifier(seed_offset=0):
    return RandomForestClassifier(
        n_estimators=180,
        min_samples_leaf=30,
        max_features=0.75,
        random_state=RANDOM_STATE + seed_offset,
        n_jobs=-1,
    )


def make_rf_regressor(seed_offset=0):
    return RandomForestRegressor(
        n_estimators=180,
        min_samples_leaf=25,
        max_features=0.75,
        random_state=RANDOM_STATE + seed_offset,
        n_jobs=-1,
    )

XW_train = pd.concat([X_train, W_train], axis=1)
XW_test = pd.concat([X_test, W_test], axis=1)

propensity_model = make_rf_classifier(10)
propensity_model.fit(XW_train, T_train)
propensity_pred = np.clip(propensity_model.predict_proba(XW_test)[:, 1], 1e-4, 1 - 1e-4)

outcome_model_control = make_rf_regressor(20)
outcome_model_treated = make_rf_regressor(30)
outcome_model_control.fit(XW_train.loc[T_train == 0], Y_train.loc[T_train == 0])
outcome_model_treated.fit(XW_train.loc[T_train == 1], Y_train.loc[T_train == 1])
mu0_pred = outcome_model_control.predict(XW_test)
mu1_pred = outcome_model_treated.predict(XW_test)
outcome_pred_observed_arm = np.where(T_test.to_numpy() == 1, mu1_pred, mu0_pred)

nuisance_diagnostics = pd.DataFrame(
    {
        "diagnostic": [
            "propensity_auc", "propensity_log_loss", "propensity_min", "propensity_max",
            "outcome_observed_arm_r2", "outcome_observed_arm_mae",
        ],
        "value": [
            roc_auc_score(T_test, propensity_pred), log_loss(T_test, propensity_pred),
            propensity_pred.min(), propensity_pred.max(),
            r2_score(Y_test, outcome_pred_observed_arm), mean_absolute_error(Y_test, outcome_pred_observed_arm),
        ],
    }
)
nuisance_diagnostics.to_csv(TABLE_DIR / f"{CASE_PREFIX}_nuisance_diagnostics.csv", index=False)
display(nuisance_diagnostics)
diagnostic value
0 propensity_auc 0.7218
1 propensity_log_loss 0.6137
2 propensity_min 0.1216
3 propensity_max 0.9082
4 outcome_observed_arm_r2 0.5002
5 outcome_observed_arm_mae 0.6858

The treatment model has signal, confirming that treatment was targeted. The outcome model is also informative enough to proceed, though it remains an approximation.

Overlap Buckets On The Held-Out Sample

Bucketed propensities reveal where support is strong or weak. The later policy section will use these scores to create a conservative support-aware targeting rule.

overlap_frame = test_df[["user_id", "onboarding_bundle", "eight_week_value", "true_cate"]].copy()
overlap_frame["estimated_propensity"] = propensity_pred
overlap_frame["overlap_bucket"] = pd.cut(
    overlap_frame["estimated_propensity"],
    bins=[0.0, 0.10, 0.25, 0.50, 0.75, 0.90, 1.0],
    include_lowest=True,
)
overlap_summary = (
    overlap_frame.groupby("overlap_bucket", observed=True)
    .agg(
        users=("user_id", "size"),
        treatment_rate=("onboarding_bundle", "mean"),
        mean_true_cate=("true_cate", "mean"),
        mean_outcome=("eight_week_value", "mean"),
    )
    .reset_index()
)
overlap_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_overlap_bucket_summary.csv", index=False)
display(overlap_summary)
overlap_bucket users treatment_rate mean_true_cate mean_outcome
0 (0.1, 0.25] 98 0.2347 -0.2653 0.3108
1 (0.25, 0.5] 647 0.3787 -0.0381 0.9534
2 (0.5, 0.75] 724 0.6105 0.3050 1.5611
3 (0.75, 0.9] 179 0.8659 0.7032 2.1917
4 (0.9, 1.0] 2 1.0000 1.4372 3.5733

Most users sit away from the most extreme propensity regions. That is encouraging, but the edge buckets are still useful for cautious reporting.

Estimator Strategy

The case study uses three complementary EconML estimators. The goal is not to crown a universal winner; it is to test whether the recommendation is stable across reasonable modeling choices.

estimator_setup = pd.DataFrame(
    [
        ("LinearDML", "Transparent structured heterogeneity with polynomial features.", "Can miss sharp nonlinearities.", "Readable baseline."),
        ("CausalForestDML", "Flexible nonlinear heterogeneity.", "Can be noisier and slower.", "Matches the nonlinear teaching setup."),
        ("DRLearner", "Doubly robust pseudo-outcome strategy.", "Can inherit variance from propensity estimates.", "Useful for robust targeting comparison."),
    ],
    columns=["estimator", "main_strength", "main_risk", "why_used_here"],
)
estimator_setup.to_csv(TABLE_DIR / f"{CASE_PREFIX}_estimator_setup.csv", index=False)
display(estimator_setup)
estimator main_strength main_risk why_used_here
0 LinearDML Transparent structured heterogeneity with poly... Can miss sharp nonlinearities. Readable baseline.
1 CausalForestDML Flexible nonlinear heterogeneity. Can be noisier and slower. Matches the nonlinear teaching setup.
2 DRLearner Doubly robust pseudo-outcome strategy. Can inherit variance from propensity estimates. Useful for robust targeting comparison.

The table names the modeling tradeoff before results are known. That keeps the comparison from becoming pure scoreboard chasing.

Fit EconML Estimators

This cell fits all estimators on the same training matrices and records runtime. The treatment is binary, so the DML estimators are configured for discrete treatment where needed.

estimators = {
    "LinearDML": LinearDML(
        model_y=make_rf_regressor(101),
        model_t=make_rf_classifier(102),
        discrete_treatment=True,
        featurizer=PolynomialFeatures(degree=2, include_bias=False),
        cv=3,
        random_state=RANDOM_STATE,
    ),
    "CausalForestDML": CausalForestDML(
        model_y=make_rf_regressor(201),
        model_t=make_rf_classifier(202),
        discrete_treatment=True,
        n_estimators=160,
        min_samples_leaf=18,
        max_samples=0.45,
        cv=3,
        random_state=RANDOM_STATE,
    ),
    "DRLearner": DRLearner(
        model_propensity=make_rf_classifier(301),
        model_regression=make_rf_regressor(302),
        model_final=make_rf_regressor(303),
        min_propensity=0.02,
        cv=3,
        random_state=RANDOM_STATE,
    ),
}

fit_rows = []
for estimator_name, estimator in estimators.items():
    start = time.perf_counter()
    estimator.fit(Y_train, T_train, X=X_train, W=W_train)
    fit_rows.append({"estimator": estimator_name, "fit_seconds": time.perf_counter() - start})

fit_runtime = pd.DataFrame(fit_rows).sort_values("fit_seconds")
fit_runtime.to_csv(TABLE_DIR / f"{CASE_PREFIX}_fit_runtime.csv", index=False)
display(fit_runtime)
estimator fit_seconds
0 LinearDML 3.4164
1 CausalForestDML 3.6490
2 DRLearner 3.9710

Runtime is part of practical model selection. A small accuracy improvement may not be worth much if it makes the workflow harder to refresh or explain.

Predict Held-Out CATEs

Each estimator now predicts a treatment effect for each held-out user. The true CATE is attached only for evaluation.

effect_predictions = test_df[["user_id", "onboarding_bundle", "eight_week_value", "true_cate", "diagnostic_propensity"]].copy()
prediction_rows = []
for estimator_name, estimator in estimators.items():
    start = time.perf_counter()
    effect_predictions[estimator_name] = estimator.effect(X_test)
    prediction_rows.append({"estimator": estimator_name, "predict_seconds": time.perf_counter() - start})

prediction_runtime = pd.DataFrame(prediction_rows).sort_values("predict_seconds")
effect_predictions.to_csv(TABLE_DIR / f"{CASE_PREFIX}_heldout_effect_predictions.csv", index=False)
prediction_runtime.to_csv(TABLE_DIR / f"{CASE_PREFIX}_prediction_runtime.csv", index=False)
display(effect_predictions.head())
display(prediction_runtime)
user_id onboarding_bundle eight_week_value true_cate diagnostic_propensity LinearDML CausalForestDML DRLearner
0 5064 1 2.0690 0.1123 0.5733 0.1962 0.0582 0.2281
1 1117 1 1.0561 0.1787 0.6108 0.0079 0.1102 0.0641
2 3396 0 1.8315 -0.2375 0.2277 -0.0012 0.0042 -0.1578
3 1511 1 1.4275 0.5047 0.7575 0.7616 0.3912 0.6287
4 5101 1 1.4083 -0.4018 0.7290 -0.1236 0.2198 0.3812
estimator predict_seconds
0 LinearDML 0.0030
1 CausalForestDML 0.0695
2 DRLearner 0.0827

The prediction table is the bridge from estimation to action. Once every held-out user has an estimated incremental effect, we can evaluate accuracy, ranking, segments, and policy value.

Held-Out Benchmark Metrics

The benchmark compares estimators using average-effect bias, individual-effect error, correlation with the true effect ranking, top-segment value, and treatment share above cost.

def safe_corr(left, right):
    if np.std(left) == 0 or np.std(right) == 0:
        return np.nan
    return float(np.corrcoef(left, right)[0, 1])

benchmark_rows = []
true_ate_test = true_tau_test.mean()
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    top_mask = estimated >= np.quantile(estimated, 0.80)
    benchmark_rows.append(
        {
            "estimator": estimator_name,
            "true_ate_test": true_ate_test,
            "estimated_ate": estimated.mean(),
            "ate_bias": estimated.mean() - true_ate_test,
            "abs_ate_bias": abs(estimated.mean() - true_ate_test),
            "cate_rmse": mean_squared_error(true_tau_test, estimated) ** 0.5,
            "cate_mae": mean_absolute_error(true_tau_test, estimated),
            "cate_correlation": safe_corr(true_tau_test, estimated),
            "top_20_true_cate_mean": true_tau_test[top_mask].mean(),
            "top_20_estimated_cate_mean": estimated[top_mask].mean(),
            "share_estimated_above_cost": np.mean(estimated > TREATMENT_COST),
        }
    )
benchmark_metrics = pd.DataFrame(benchmark_rows).merge(fit_runtime, on="estimator").merge(prediction_runtime, on="estimator")
benchmark_metrics = benchmark_metrics.sort_values("cate_rmse")
benchmark_metrics.to_csv(TABLE_DIR / f"{CASE_PREFIX}_benchmark_metrics.csv", index=False)
display(benchmark_metrics)
estimator true_ate_test estimated_ate ate_bias abs_ate_bias cate_rmse cate_mae cate_correlation top_20_true_cate_mean top_20_estimated_cate_mean share_estimated_above_cost fit_seconds predict_seconds
0 LinearDML 0.1812 0.2309 0.0497 0.0497 0.2719 0.2085 0.8948 0.7908 1.0499 0.5394 3.4164 0.0030
2 DRLearner 0.1812 0.2099 0.0287 0.0287 0.2720 0.2129 0.8458 0.7515 0.9236 0.5188 3.9710 0.0827
1 CausalForestDML 0.1812 0.2147 0.0336 0.0336 0.3168 0.2404 0.8173 0.7613 0.5643 0.5558 3.6490 0.0695

The best estimator depends on the use case. For targeting, ranking quality and top-segment value can matter more than a tiny difference in average-effect bias.

Accuracy And Bias Plot

The next figure compares CATE RMSE with absolute ATE bias. This catches models that estimate the average well but miss heterogeneous effects.

metric_plot = benchmark_metrics.melt(
    id_vars="estimator",
    value_vars=["cate_rmse", "abs_ate_bias"],
    var_name="metric",
    value_name="value",
)
metric_plot["metric"] = metric_plot["metric"].map({"cate_rmse": "CATE RMSE", "abs_ate_bias": "Absolute ATE Bias"})

fig, ax = plt.subplots(figsize=(9, 5))
sns.barplot(data=metric_plot, x="value", y="estimator", hue="metric", ax=ax)
ax.set_title("Held-Out Accuracy And Average-Effect Bias")
ax.set_xlabel("Error")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_accuracy_and_bias.png", dpi=160, bbox_inches="tight")
plt.show()

This plot makes the accuracy tradeoff visible. In real work, true CATE would be unavailable, so analogous validation would rely on experiments, temporal checks, or sensitivity analysis.

CATE Recovery Scatterplots

Scatterplots show whether estimated effects recover the shape of the true effect surface. Points near the diagonal are closer to the known truth.

scatter_data = effect_predictions.melt(
    id_vars=["user_id", "true_cate"],
    value_vars=list(estimators.keys()),
    var_name="estimator",
    value_name="estimated_cate",
)
min_value = min(scatter_data["true_cate"].min(), scatter_data["estimated_cate"].min())
max_value = max(scatter_data["true_cate"].max(), scatter_data["estimated_cate"].max())

g = sns.FacetGrid(scatter_data, col="estimator", col_wrap=3, height=3.6, sharex=True, sharey=True)
g.map_dataframe(sns.scatterplot, x="true_cate", y="estimated_cate", alpha=0.28, s=18, color="#2563eb")
for ax in g.axes.flat:
    ax.plot([min_value, max_value], [min_value, max_value], color="#dc2626", linestyle="--", linewidth=1)
    ax.set_xlabel("True CATE")
    ax.set_ylabel("Estimated CATE")
g.fig.suptitle("Held-Out CATE Recovery", y=1.02)
plt.tight_layout()
g.fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_cate_recovery_scatterplots.png", dpi=160, bbox_inches="tight")
plt.show()

The scatterplots show whether estimates are simply compressed toward the mean or whether they separate high- and low-benefit users.

Decile Calibration

Calibration asks whether users placed into higher estimated-effect buckets also have higher true effects. This is a practical targeting diagnostic.

calibration_parts = []
for estimator_name in estimators:
    temp = effect_predictions[["user_id", "true_cate", estimator_name]].copy()
    temp["estimated_cate"] = temp[estimator_name]
    temp["estimated_decile"] = pd.qcut(temp["estimated_cate"], q=10, labels=False, duplicates="drop") + 1
    summary = (
        temp.groupby("estimated_decile", observed=True)
        .agg(
            users=("user_id", "size"),
            mean_estimated_cate=("estimated_cate", "mean"),
            mean_true_cate=("true_cate", "mean"),
        )
        .reset_index()
    )
    summary["estimator"] = estimator_name
    calibration_parts.append(summary)
calibration_table = pd.concat(calibration_parts, ignore_index=True)
calibration_table.to_csv(TABLE_DIR / f"{CASE_PREFIX}_decile_calibration.csv", index=False)
display(calibration_table.head(15))
estimated_decile users mean_estimated_cate mean_true_cate estimator
0 1 165 -0.8145 -0.5506 LinearDML
1 2 165 -0.3783 -0.2866 LinearDML
2 3 165 -0.1381 -0.1077 LinearDML
3 4 165 0.0292 0.0188 LinearDML
4 5 165 0.1672 0.1223 LinearDML
5 6 165 0.2903 0.2248 LinearDML
6 7 165 0.4427 0.3436 LinearDML
7 8 165 0.6106 0.4654 LinearDML
8 9 165 0.8049 0.6159 LinearDML
9 10 165 1.2949 0.9657 LinearDML
10 1 165 -0.1988 -0.4769 CausalForestDML
11 2 165 -0.0617 -0.2107 CausalForestDML
12 3 165 0.0309 -0.0988 CausalForestDML
13 4 165 0.1072 0.0241 CausalForestDML
14 5 165 0.1828 0.1088 CausalForestDML

A useful targeting model should show rising true effects as estimated deciles increase. If that pattern fails, individual targeting is hard to justify.

Calibration Plot

The plot compares estimated and true average CATE by estimated-effect decile for each estimator.

calibration_plot = calibration_table.melt(
    id_vars=["estimator", "estimated_decile", "users"],
    value_vars=["mean_estimated_cate", "mean_true_cate"],
    var_name="series",
    value_name="cate",
)
calibration_plot["series"] = calibration_plot["series"].map({"mean_estimated_cate": "Estimated", "mean_true_cate": "True"})

g = sns.relplot(
    data=calibration_plot,
    x="estimated_decile",
    y="cate",
    hue="series",
    col="estimator",
    kind="line",
    marker="o",
    col_wrap=3,
    height=3.4,
    facet_kws={"sharey": True},
)
g.set_axis_labels("Estimated CATE Decile", "Average CATE")
g.fig.suptitle("CATE Decile Calibration", y=1.03)
plt.tight_layout()
g.fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_decile_calibration.png", dpi=160, bbox_inches="tight")
plt.show()

This view is decision-oriented. A model can be noisy at the individual level yet still useful if it ranks high-benefit users reliably.

Segment-Level Checks

Segment checks reveal whether estimator behavior is stable across meaningful groups. This catches cases where a model looks good overall but fails in a segment that matters.

segment_definitions = {
    "high_onboarding_gap": test_df["onboarding_gap"] >= test_df["onboarding_gap"].median(),
    "high_prior_sessions": test_df["prior_sessions"] >= test_df["prior_sessions"].median(),
    "high_search_difficulty": test_df["search_difficulty"].fillna(train_medians["search_difficulty"]) >= test_df["search_difficulty"].fillna(train_medians["search_difficulty"]).median(),
    "high_price_sensitivity": test_df["price_sensitivity"] >= test_df["price_sensitivity"].median(),
    "high_motivation_segment": test_df["high_motivation_segment"] == 1,
    "organic_referral": test_df["referral_organic"] == 1,
    "mobile_primary": test_df["mobile_primary"] == 1,
}

segment_rows = []
for segment_name, mask in segment_definitions.items():
    mask_array = mask.to_numpy()
    for estimator_name in estimators:
        estimated = effect_predictions[estimator_name].to_numpy()
        segment_rows.append(
            {
                "segment": segment_name,
                "estimator": estimator_name,
                "users": int(mask_array.sum()),
                "true_cate_mean": true_tau_test[mask_array].mean(),
                "estimated_cate_mean": estimated[mask_array].mean(),
                "segment_bias": estimated[mask_array].mean() - true_tau_test[mask_array].mean(),
                "share_estimated_above_cost": np.mean(estimated[mask_array] > TREATMENT_COST),
            }
        )
segment_benchmark = pd.DataFrame(segment_rows)
segment_benchmark.to_csv(TABLE_DIR / f"{CASE_PREFIX}_segment_benchmark.csv", index=False)
display(segment_benchmark.head(15))
segment estimator users true_cate_mean estimated_cate_mean segment_bias share_estimated_above_cost
0 high_onboarding_gap LinearDML 825 0.4101 0.5078 0.0977 0.7527
1 high_onboarding_gap CausalForestDML 825 0.4101 0.3854 -0.0247 0.8800
2 high_onboarding_gap DRLearner 825 0.4101 0.4808 0.0707 0.7564
3 high_prior_sessions LinearDML 825 0.2634 0.2984 0.0350 0.5927
4 high_prior_sessions CausalForestDML 825 0.2634 0.2665 0.0031 0.6461
5 high_prior_sessions DRLearner 825 0.2634 0.2666 0.0031 0.5697
6 high_search_difficulty LinearDML 825 0.2798 0.3400 0.0602 0.6085
7 high_search_difficulty CausalForestDML 825 0.2798 0.2913 0.0115 0.6812
8 high_search_difficulty DRLearner 825 0.2798 0.3170 0.0372 0.6048
9 high_price_sensitivity LinearDML 825 0.1520 0.1980 0.0460 0.5139
10 high_price_sensitivity CausalForestDML 825 0.1520 0.2295 0.0775 0.5758
11 high_price_sensitivity DRLearner 825 0.1520 0.2024 0.0504 0.4994
12 high_motivation_segment LinearDML 852 0.3199 0.4238 0.1039 0.6878
13 high_motivation_segment CausalForestDML 852 0.3199 0.2690 -0.0509 0.6455
14 high_motivation_segment DRLearner 852 0.3199 0.3154 -0.0045 0.6092

Segment summaries show where estimated effects are systematically high or low. In real analysis, this would be a stability check rather than a truth-scoring table.

Segment Bias Heatmap

The heatmap makes segment-level average bias easy to scan across estimators.

segment_heatmap = segment_benchmark.pivot(index="segment", columns="estimator", values="segment_bias")
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.heatmap(segment_heatmap, annot=True, fmt=".3f", cmap="vlag", center=0, linewidths=0.5, ax=ax)
ax.set_title("Segment-Level Average CATE Bias")
ax.set_xlabel("")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_segment_bias_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

The heatmap separates global accuracy from segment reliability. A strong recommendation should mention both.

Cost-Aware Policy Targeting

The simplest policy treats a user when estimated CATE is greater than treatment cost. We compare that with two baselines: treat nobody and treat everybody.

def summarize_policy(policy_name, action, estimated_effect=None):
    action = action.astype(int)
    true_net = (true_tau_test - TREATMENT_COST) * action
    return {
        "policy": policy_name,
        "treatment_rate": action.mean(),
        "users_treated": int(action.sum()),
        "true_net_value_per_user": true_net.mean(),
        "true_gross_effect_per_user": (true_tau_test * action).mean(),
        "precision_positive_net": np.mean(true_tau_test[action == 1] > TREATMENT_COST) if action.sum() > 0 else np.nan,
        "mean_true_cate_among_treated": true_tau_test[action == 1].mean() if action.sum() > 0 else np.nan,
        "mean_estimated_cate_among_treated": estimated_effect[action == 1].mean() if estimated_effect is not None and action.sum() > 0 else np.nan,
    }

policy_rows = [
    summarize_policy("Treat none", np.zeros(len(test_df), dtype=bool)),
    summarize_policy("Treat all", np.ones(len(test_df), dtype=bool)),
]
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    policy_rows.append(summarize_policy(f"{estimator_name}: effect > cost", estimated > TREATMENT_COST, estimated))

policy_summary = pd.DataFrame(policy_rows).sort_values("true_net_value_per_user", ascending=False)
policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_policy_summary.csv", index=False)
display(policy_summary)
policy treatment_rate users_treated true_net_value_per_user true_gross_effect_per_user precision_positive_net mean_true_cate_among_treated mean_estimated_cate_among_treated
2 LinearDML: effect > cost 0.5394 890 0.1709 0.2679 0.8292 0.4968 0.6535
4 DRLearner: effect > cost 0.5188 856 0.1642 0.2575 0.8236 0.4964 0.5824
3 CausalForestDML: effect > cost 0.5558 917 0.1540 0.2540 0.7819 0.4571 0.3954
1 Treat all 1.0000 1650 0.0012 0.1812 0.4988 0.1812 NaN
0 Treat none 0.0000 0 0.0000 0.0000 NaN NaN NaN

The policy table is the handoff from estimation to action. A good CATE model should improve net value over simple baselines after treatment cost is included.

Policy Value Plot

The y-axis lists policies and the x-axis shows true net value per held-out user after subtracting treatment cost.

fig, ax = plt.subplots(figsize=(10, 5))
plot_policy = policy_summary.sort_values("true_net_value_per_user")
sns.barplot(data=plot_policy, x="true_net_value_per_user", y="policy", color="#22c55e", ax=ax)
ax.axvline(0, color="#111827", linewidth=1)
ax.set_title("Held-Out Net Policy Value")
ax.set_xlabel("True Net Value Per User")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_policy_value.png", dpi=160, bbox_inches="tight")
plt.show()

This is the most decision-facing plot in the notebook. It converts estimated effects into a concrete recommendation and compares that recommendation with obvious alternatives.

Budget-Constrained Targeting

A threshold rule is not the only practical choice. If there is a fixed budget, the team may treat only the highest estimated-effect users. This cell evaluates several budget levels.

budget_levels = [0.10, 0.20, 0.30, 0.40, 0.50]
budget_rows = []
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    for budget in budget_levels:
        cutoff = np.quantile(estimated, 1 - budget)
        action = estimated >= cutoff
        budget_rows.append(
            {
                "estimator": estimator_name,
                "budget_share": budget,
                "users_treated": int(action.sum()),
                "true_net_value_per_user": ((true_tau_test - TREATMENT_COST) * action).mean(),
                "mean_true_cate_among_treated": true_tau_test[action].mean(),
                "precision_positive_net": np.mean(true_tau_test[action] > TREATMENT_COST),
            }
        )
budget_curve = pd.DataFrame(budget_rows)
budget_curve.to_csv(TABLE_DIR / f"{CASE_PREFIX}_budget_curve.csv", index=False)
display(budget_curve.head(10))
estimator budget_share users_treated true_net_value_per_user mean_true_cate_among_treated precision_positive_net
0 LinearDML 0.1000 165 0.0786 0.9657 1.0000
1 LinearDML 0.2000 330 0.1222 0.7908 0.9939
2 LinearDML 0.3000 495 0.1507 0.6823 0.9636
3 LinearDML 0.4000 660 0.1671 0.5977 0.9182
4 LinearDML 0.5000 825 0.1715 0.5231 0.8570
5 CausalForestDML 0.1000 165 0.0749 0.9294 1.0000
6 CausalForestDML 0.2000 330 0.1163 0.7613 0.9727
7 CausalForestDML 0.3000 495 0.1396 0.6453 0.9131
8 CausalForestDML 0.4000 660 0.1530 0.5624 0.8682
9 CausalForestDML 0.5000 825 0.1565 0.4930 0.8133

Budget curves show whether an estimator remains useful as the program expands from a small high-confidence group to a broader population.

Budget Curve Plot

The curve shows net value as the share of treated users increases. Strong policies usually perform best at low budgets and then flatten as lower-benefit users are added.

fig, ax = plt.subplots(figsize=(9, 5))
sns.lineplot(data=budget_curve, x="budget_share", y="true_net_value_per_user", hue="estimator", marker="o", ax=ax)
ax.axhline(0, color="#111827", linewidth=1)
ax.set_title("Budget-Constrained Targeting Value")
ax.set_xlabel("Share Of Users Treated")
ax.set_ylabel("True Net Value Per User")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_budget_curve.png", dpi=160, bbox_inches="tight")
plt.show()

This curve helps choose an operating point. It also avoids implying that a model supports broad rollout when its value is concentrated in a small top segment.

Support-Aware Policy Filter

A conservative policy can restrict model-based targeting to users with diagnostic propensities between 0.10 and 0.90. This avoids acting in regions where observational support is thin.

support_mask = (overlap_frame["estimated_propensity"].to_numpy() >= 0.10) & (overlap_frame["estimated_propensity"].to_numpy() <= 0.90)
support_policy_rows = []
for estimator_name in estimators:
    estimated = effect_predictions[estimator_name].to_numpy()
    standard_action = estimated > TREATMENT_COST
    support_action = standard_action & support_mask
    support_policy_rows.append(
        {
            "estimator": estimator_name,
            "eligible_share": support_mask.mean(),
            "standard_treatment_rate": standard_action.mean(),
            "support_aware_treatment_rate": support_action.mean(),
            "standard_true_net_value": ((true_tau_test - TREATMENT_COST) * standard_action).mean(),
            "support_aware_true_net_value": ((true_tau_test - TREATMENT_COST) * support_action).mean(),
            "users_removed_by_support_filter": int((standard_action & ~support_mask).sum()),
        }
    )
support_policy_summary = pd.DataFrame(support_policy_rows).sort_values("support_aware_true_net_value", ascending=False)
support_policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_support_aware_policy_summary.csv", index=False)
display(support_policy_summary)
estimator eligible_share standard_treatment_rate support_aware_treatment_rate standard_true_net_value support_aware_true_net_value users_removed_by_support_filter
0 LinearDML 0.9988 0.5394 0.5382 0.1709 0.1693 2
2 DRLearner 0.9988 0.5188 0.5176 0.1642 0.1626 2
1 CausalForestDML 0.9988 0.5558 0.5545 0.1540 0.1525 2

The support-aware version is more conservative. It may sacrifice some value in this simulation, but it is often a healthier recommendation when overlap is imperfect.

Bootstrap Uncertainty For Policy Value

The next cell bootstraps the held-out rows and recomputes policy value. This gives a lightweight uncertainty summary for policy comparisons.

bootstrap_rng = np.random.default_rng(RANDOM_STATE + 99)
N_BOOTSTRAPS = 300
policy_actions = {
    "Treat none": np.zeros(len(test_df), dtype=bool),
    "Treat all": np.ones(len(test_df), dtype=bool),
}
for estimator_name in estimators:
    policy_actions[f"{estimator_name}: effect > cost"] = effect_predictions[estimator_name].to_numpy() > TREATMENT_COST

bootstrap_rows = []
for draw in range(N_BOOTSTRAPS):
    sample_index = bootstrap_rng.choice(len(test_df), size=len(test_df), replace=True)
    for policy_name, action in policy_actions.items():
        value = ((true_tau_test[sample_index] - TREATMENT_COST) * action[sample_index]).mean()
        bootstrap_rows.append({"draw": draw, "policy": policy_name, "true_net_value_per_user": value})

bootstrap_policy = pd.DataFrame(bootstrap_rows)
bootstrap_policy_summary = (
    bootstrap_policy.groupby("policy")
    .agg(
        mean_value=("true_net_value_per_user", "mean"),
        lower_95=("true_net_value_per_user", lambda values: np.quantile(values, 0.025)),
        upper_95=("true_net_value_per_user", lambda values: np.quantile(values, 0.975)),
    )
    .reset_index()
    .sort_values("mean_value", ascending=False)
)
bootstrap_policy.to_csv(TABLE_DIR / f"{CASE_PREFIX}_bootstrap_policy_values.csv", index=False)
bootstrap_policy_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_bootstrap_policy_summary.csv", index=False)
display(bootstrap_policy_summary)
policy mean_value lower_95 upper_95
2 LinearDML: effect > cost 0.1701 0.1560 0.1837
1 DRLearner: effect > cost 0.1635 0.1490 0.1773
0 CausalForestDML: effect > cost 0.1531 0.1386 0.1663
3 Treat all 0.0006 -0.0204 0.0228
4 Treat none 0.0000 0.0000 0.0000

The intervals show whether the apparent policy winner is clearly separated or only slightly ahead. That matters for honest reporting.

Bootstrap Policy Plot

This plot summarizes uncertainty around net policy value. Wider intervals mean the estimated policy value is more sensitive to which users are in the evaluation sample.

plot_bootstrap = bootstrap_policy_summary.sort_values("mean_value")
fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=plot_bootstrap["mean_value"],
    y=plot_bootstrap["policy"],
    xerr=[
        plot_bootstrap["mean_value"] - plot_bootstrap["lower_95"],
        plot_bootstrap["upper_95"] - plot_bootstrap["mean_value"],
    ],
    fmt="o",
    color="#2563eb",
    ecolor="#334155",
    capsize=4,
)
ax.axvline(0, color="#111827", linewidth=1)
ax.set_title("Bootstrap Uncertainty For Policy Value")
ax.set_xlabel("True Net Value Per User")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{CASE_PREFIX}_bootstrap_policy_intervals.png", dpi=160, bbox_inches="tight")
plt.show()

The final recommendation should consider both point value and uncertainty. If intervals overlap heavily, the safer conclusion is that policies are practically similar until stronger validation is available.

Report-Ready Summary Table

The final comparison table combines estimation accuracy, targeting value, and runtime. It is designed to be readable outside the notebook.

policy_rows_for_estimators = policy_summary[policy_summary["policy"].str.contains(": effect > cost", regex=False)].copy()
policy_rows_for_estimators["estimator"] = policy_rows_for_estimators["policy"].str.replace(": effect > cost", "", regex=False)
report_summary = (
    benchmark_metrics[["estimator", "estimated_ate", "ate_bias", "cate_rmse", "cate_correlation", "top_20_true_cate_mean", "fit_seconds"]]
    .merge(
        policy_rows_for_estimators[["estimator", "treatment_rate", "true_net_value_per_user", "precision_positive_net"]],
        on="estimator",
    )
    .sort_values("true_net_value_per_user", ascending=False)
)
report_summary.to_csv(TABLE_DIR / f"{CASE_PREFIX}_report_summary.csv", index=False)
display(report_summary)
estimator estimated_ate ate_bias cate_rmse cate_correlation top_20_true_cate_mean fit_seconds treatment_rate true_net_value_per_user precision_positive_net
0 LinearDML 0.2309 0.0497 0.2719 0.8948 0.7908 3.4164 0.5394 0.1709 0.8292
1 DRLearner 0.2099 0.0287 0.2720 0.8458 0.7515 3.9710 0.5188 0.1642 0.8236
2 CausalForestDML 0.2147 0.0336 0.3168 0.8173 0.7613 3.6490 0.5558 0.1540 0.7819

This table prevents the conclusion from relying on a single metric. A useful causal model must estimate effects, rank users, and support a decision that creates net value.

Written Case Study Recommendation

The next cell writes a compact markdown summary from the notebook outputs. It names the decision, the strongest policy, the evidence used, and the caveats.

best_policy = policy_summary.iloc[0]
best_estimator_by_rmse = benchmark_metrics.iloc[0]

report_lines = [
    "# End-To-End EconML Case Study Summary",
    "",
    "## Decision",
    "",
    f"Estimate which users should receive a proactive onboarding bundle when treatment cost is `{TREATMENT_COST:.2f}` outcome units per treated user.",
    "",
    "## Main Result",
    "",
    f"The strongest held-out policy by true net value is **{best_policy['policy']}**.",
    "",
    f"- Treatment rate: `{best_policy['treatment_rate']:.1%}`",
    f"- True net value per user: `{best_policy['true_net_value_per_user']:.4f}`",
    f"- Precision among treated users: `{best_policy['precision_positive_net']:.1%}`",
    "",
    f"The estimator with the lowest held-out CATE RMSE is **{best_estimator_by_rmse['estimator']}** with RMSE `{best_estimator_by_rmse['cate_rmse']:.4f}`.",
    "",
    "## Evidence Used",
    "",
    "The recommendation uses covariate balance checks, overlap diagnostics, nuisance-model diagnostics, held-out CATE recovery, segment bias checks, policy value comparisons, budget curves, and bootstrap policy intervals.",
    "",
    "## Caveats",
    "",
    "This is a teaching dataset with known ground truth. In real observational work, hidden true CATE would not be available, so the workflow would need domain review of confounders, sensitivity analysis, temporal validation, and ideally an experiment for policy confirmation.",
]
report_text = "\n".join(report_lines)
report_path = REPORT_DIR / f"{CASE_PREFIX}_case_study_summary.md"
report_path.write_text(report_text)
display(Markdown(report_text))
print(f"Saved report to: {report_path}")

End-To-End EconML Case Study Summary

Decision

Estimate which users should receive a proactive onboarding bundle when treatment cost is 0.18 outcome units per treated user.

Main Result

The strongest held-out policy by true net value is LinearDML: effect > cost.

  • Treatment rate: 53.9%
  • True net value per user: 0.1709
  • Precision among treated users: 82.9%

The estimator with the lowest held-out CATE RMSE is LinearDML with RMSE 0.2719.

Evidence Used

The recommendation uses covariate balance checks, overlap diagnostics, nuisance-model diagnostics, held-out CATE recovery, segment bias checks, policy value comparisons, budget curves, and bootstrap policy intervals.

Caveats

This is a teaching dataset with known ground truth. In real observational work, hidden true CATE would not be available, so the workflow would need domain review of confounders, sensitivity analysis, temporal validation, and ideally an experiment for policy confirmation.

Saved report to: /home/apex/Documents/ranking_sys/notebooks/tutorials/econml/outputs/reports/14_case_study_summary.md

The written summary is short because the notebook already contains the evidence. The recommendation is useful precisely because it pairs a policy result with diagnostics and caveats.

Final Checklist

The final checklist turns the notebook into a reusable workflow. In real analyses, missing checklist items should be treated as reasons for caution.

case_study_checklist = pd.DataFrame(
    [
        ("Decision defined", "Treatment, outcome, cost, and target population are explicit."),
        ("Pre-treatment variables separated", "Effect modifiers and controls are assigned before fitting."),
        ("Missingness handled", "Imputation is learned from training data only."),
        ("Confounding diagnosed", "Raw gaps, balance, and propensity overlap are reviewed."),
        ("Nuisance models checked", "Treatment and outcome models have held-out diagnostics."),
        ("Multiple estimators compared", "Results are not tied to one estimator by default."),
        ("Policy value evaluated", "CATE estimates are translated into net decision value."),
        ("Support considered", "Weak-overlap regions are handled cautiously."),
        ("Uncertainty shown", "Bootstrap intervals or comparable uncertainty summaries are included."),
        ("Caveats documented", "Assumptions and validation needs are written plainly."),
    ],
    columns=["check", "what_good_reporting_should_include"],
)
case_study_checklist.to_csv(TABLE_DIR / f"{CASE_PREFIX}_case_study_checklist.csv", index=False)
display(case_study_checklist)
check what_good_reporting_should_include
0 Decision defined Treatment, outcome, cost, and target populatio...
1 Pre-treatment variables separated Effect modifiers and controls are assigned bef...
2 Missingness handled Imputation is learned from training data only.
3 Confounding diagnosed Raw gaps, balance, and propensity overlap are ...
4 Nuisance models checked Treatment and outcome models have held-out dia...
5 Multiple estimators compared Results are not tied to one estimator by default.
6 Policy value evaluated CATE estimates are translated into net decisio...
7 Support considered Weak-overlap regions are handled cautiously.
8 Uncertainty shown Bootstrap intervals or comparable uncertainty ...
9 Caveats documented Assumptions and validation needs are written p...

The checklist closes the loop. A strong EconML workflow is not only estimator fitting; it is design, diagnostics, estimation, decision translation, and careful reporting.

What Comes Next

The final notebook in this EconML tutorial sequence should cover common pitfalls, debugging, and reporting. This case study used a clean teaching setup; real applied analyses are more likely to fail through leakage, weak overlap, invalid controls, unstable nuisance models, or overconfident CATE storytelling.