EconML Tutorial 01: CATE Foundations And Potential Outcomes

This notebook builds the conceptual foundation for the EconML tutorial series. Before fitting DML, causal forests, DR learners, or meta-learners, we need a precise answer to a simpler question:

What treatment effect are we trying to estimate?

EconML is especially useful when the effect is not the same for everyone. That means we need to understand:

The dataset is synthetic, so both potential outcomes are known inside the notebook. That would not be true in real data, but it gives us a clean teaching sandbox.

Learning Goals

By the end of this notebook, you should be able to:

  • Define potential outcomes Y(0) and Y(1).
  • Explain why individual treatment effects are not directly observed in real data.
  • Distinguish ATE, ATT, ATC, CATE, and ITE-style language.
  • Diagnose confounding and overlap before estimating treatment effects.
  • Show why one ATE can hide meaningful segment-level differences.
  • Use an oracle synthetic dataset to connect CATE to treatment targeting.
  • Understand why later EconML notebooks need nuisance models and effect modifiers.

Why Foundations Matter For EconML

EconML estimators can produce a treatment-effect estimate for every row. That is powerful, but it is easy to misuse if the estimand is unclear.

A row-level CATE estimate is not magic personalization. It is an estimate of an expected contrast under assumptions:

E[Y(1) - Y(0) | X = x]

The conditioning features X define the heterogeneity we want to learn. The controls W help adjust for confounding. Later notebooks will fit estimators; this notebook focuses on what those estimators are trying to recover.

Setup

This cell imports the libraries, creates output folders, checks the EconML version, and sets plotting defaults. We keep this notebook estimator-light, but the import check confirms the tutorial environment is still ready for the later EconML notebooks.

from pathlib import Path
import os
import warnings
import importlib.metadata as importlib_metadata

# Keep Matplotlib cache files in a writable location during notebook execution.
os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib-ranking-sys")

warnings.filterwarnings("default")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=PendingDeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*IProgress not found.*")
warnings.filterwarnings("ignore", message=".*X does not have valid feature names.*")
warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

try:
    import econml
    ECONML_AVAILABLE = True
    ECONML_VERSION = getattr(econml, "__version__", "unknown")
except Exception as exc:
    ECONML_AVAILABLE = False
    ECONML_VERSION = f"import failed: {type(exc).__name__}: {exc}"

RANDOM_SEED = 2026
rng = np.random.default_rng(RANDOM_SEED)

OUTPUT_DIR = Path("outputs")
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
FIGURE_DIR.mkdir(parents=True, exist_ok=True)
TABLE_DIR.mkdir(parents=True, exist_ok=True)

sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.max_columns", 100)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

print(f"EconML available: {ECONML_AVAILABLE}")
print(f"EconML version: {ECONML_VERSION}")
print(f"Figures will be saved to: {FIGURE_DIR.resolve()}")
print(f"Tables will be saved to: {TABLE_DIR.resolve()}")
EconML available: True
EconML version: 0.16.0
Figures will be saved to: /home/apex/Documents/ranking_sys/notebooks/tutorials/econml/outputs/figures
Tables will be saved to: /home/apex/Documents/ranking_sys/notebooks/tutorials/econml/outputs/tables

The setup confirms that the environment is ready. Every saved artifact from this notebook uses a 01_ prefix.

Estimand Vocabulary

The next table defines the core treatment-effect quantities used throughout the EconML series. The names look similar, but they answer different questions.

estimand_vocabulary = pd.DataFrame(
    [
        {
            "term": "Potential outcome Y(0)",
            "plain meaning": "Outcome the unit would have under no treatment.",
            "conditioning": "unit-level hypothetical",
            "observed in real data": "only if T = 0",
        },
        {
            "term": "Potential outcome Y(1)",
            "plain meaning": "Outcome the unit would have under treatment.",
            "conditioning": "unit-level hypothetical",
            "observed in real data": "only if T = 1",
        },
        {
            "term": "ITE-style contrast",
            "plain meaning": "Y(1) - Y(0) for a unit.",
            "conditioning": "single unit",
            "observed in real data": "no, because one potential outcome is missing",
        },
        {
            "term": "ATE",
            "plain meaning": "Average treatment effect in the whole population.",
            "conditioning": "none or full population",
            "observed in real data": "estimated under assumptions",
        },
        {
            "term": "ATT",
            "plain meaning": "Average treatment effect among treated units.",
            "conditioning": "T = 1 population",
            "observed in real data": "estimated under assumptions",
        },
        {
            "term": "ATC",
            "plain meaning": "Average treatment effect among control units.",
            "conditioning": "T = 0 population",
            "observed in real data": "estimated under assumptions",
        },
        {
            "term": "CATE",
            "plain meaning": "Average treatment effect for units with features X = x.",
            "conditioning": "effect modifiers X",
            "observed in real data": "estimated under assumptions",
        },
    ]
)

estimand_vocabulary.to_csv(TABLE_DIR / "01_estimand_vocabulary.csv", index=False)
display(estimand_vocabulary)
term plain meaning conditioning observed in real data
0 Potential outcome Y(0) Outcome the unit would have under no treatment. unit-level hypothetical only if T = 0
1 Potential outcome Y(1) Outcome the unit would have under treatment. unit-level hypothetical only if T = 1
2 ITE-style contrast Y(1) - Y(0) for a unit. single unit no, because one potential outcome is missing
3 ATE Average treatment effect in the whole population. none or full population estimated under assumptions
4 ATT Average treatment effect among treated units. T = 1 population estimated under assumptions
5 ATC Average treatment effect among control units. T = 0 population estimated under assumptions
6 CATE Average treatment effect for units with featur... effect modifiers X estimated under assumptions

The key EconML target is usually CATE. The average effect still matters, but heterogeneity is the reason to reach for a specialized library.

Identification Assumptions

Potential-outcomes notation does not identify effects by itself. We need assumptions that connect observed data to the missing counterfactual outcomes.

assumption_table = pd.DataFrame(
    [
        {
            "assumption": "Consistency",
            "plain meaning": "The observed outcome equals the potential outcome under the treatment actually received.",
            "why it matters": "Lets us write Y = T*Y(1) + (1-T)*Y(0).",
        },
        {
            "assumption": "No interference",
            "plain meaning": "One unit's treatment does not change another unit's potential outcomes.",
            "why it matters": "Lets each row be treated as its own treatment-effect problem.",
        },
        {
            "assumption": "Ignorability / unconfoundedness",
            "plain meaning": "After observed covariates, treatment assignment is as-if random.",
            "why it matters": "Lets observed controls stand in for the missing counterfactual assignment process.",
        },
        {
            "assumption": "Overlap / positivity",
            "plain meaning": "Every relevant covariate region has some chance of treatment and control.",
            "why it matters": "Lets us compare like with like instead of extrapolating everywhere.",
        },
        {
            "assumption": "Correct feature timing",
            "plain meaning": "X and W are measured before treatment and are not outcome leakage.",
            "why it matters": "Prevents post-treatment or future variables from contaminating CATE estimates.",
        },
    ]
)

assumption_table.to_csv(TABLE_DIR / "01_identification_assumptions.csv", index=False)
display(assumption_table)
assumption plain meaning why it matters
0 Consistency The observed outcome equals the potential outc... Lets us write Y = T*Y(1) + (1-T)*Y(0).
1 No interference One unit's treatment does not change another u... Lets each row be treated as its own treatment-...
2 Ignorability / unconfoundedness After observed covariates, treatment assignmen... Lets observed controls stand in for the missin...
3 Overlap / positivity Every relevant covariate region has some chanc... Lets us compare like with like instead of extr...
4 Correct feature timing X and W are measured before treatment and are ... Prevents post-treatment or future variables fr...

These assumptions are not EconML-specific; they are causal inference assumptions. EconML gives estimators, not automatic identification guarantees.

Simulate Potential Outcomes

We now create a synthetic dataset where both Y0 and Y1 are known. The analyst would normally observe only one of them, but we keep both in the notebook so the foundations are visible.

N = 5_000

baseline_need = rng.normal(0, 1, size=N)
prior_engagement = rng.normal(0, 1, size=N)
account_tenure = rng.normal(0, 1, size=N)
friction_score = rng.normal(0, 1, size=N)
region_risk = rng.binomial(1, 0.35, size=N)
high_need_segment = (baseline_need > 0).astype(int)

# Baseline outcome under no treatment.
shared_noise = rng.normal(0, 0.70, size=N)
y0 = (
    0.65 * baseline_need
    + 0.35 * prior_engagement
    - 0.25 * account_tenure
    - 0.35 * friction_score
    - 0.15 * region_risk
    + shared_noise
)

# Heterogeneous treatment effect. Some units benefit more than others.
true_cate = (
    0.30
    + 0.45 * high_need_segment
    + 0.25 * prior_engagement
    - 0.35 * friction_score
    + 0.10 * baseline_need
    - 0.05 * region_risk
)
y1 = y0 + true_cate

# Observational treatment assignment: treatment is more likely for some higher-need and higher-friction units.
propensity = 1 / (1 + np.exp(-(
    -0.20
    + 0.75 * baseline_need
    + 0.50 * prior_engagement
    - 0.30 * account_tenure
    + 0.45 * friction_score
    + 0.25 * region_risk
)))
treatment = rng.binomial(1, propensity, size=N)

observed_outcome = np.where(treatment == 1, y1, y0)
missing_counterfactual = np.where(treatment == 1, y0, y1)

potential_df = pd.DataFrame(
    {
        "baseline_need": baseline_need,
        "prior_engagement": prior_engagement,
        "account_tenure": account_tenure,
        "friction_score": friction_score,
        "region_risk": region_risk,
        "high_need_segment": high_need_segment,
        "propensity": propensity,
        "treatment": treatment,
        "y0": y0,
        "y1": y1,
        "observed_outcome": observed_outcome,
        "missing_counterfactual": missing_counterfactual,
        "true_cate": true_cate,
    }
)

analyst_df = potential_df.drop(columns=["y0", "y1", "missing_counterfactual", "true_cate", "propensity"]).copy()

potential_df.to_csv(TABLE_DIR / "01_potential_outcomes_teaching_data_with_truth.csv", index=False)
analyst_df.to_csv(TABLE_DIR / "01_analyst_observed_data.csv", index=False)

display(potential_df.head())
baseline_need prior_engagement account_tenure friction_score region_risk high_need_segment propensity treatment y0 y1 observed_outcome missing_counterfactual true_cate
0 -0.7931 -1.2466 -0.1389 0.5943 1 0 0.2975 0 -2.4796 -2.8285 -2.4796 -2.8285 -0.3490
1 0.2406 0.1167 -1.5248 1.3611 0 1 0.7519 0 1.6533 1.9801 1.6533 1.9801 0.3268
2 -1.8963 0.0111 -1.5855 1.2671 0 0 0.3610 1 -0.7512 -1.0815 -1.0815 -0.7512 -0.3303
3 1.3958 -0.2102 -0.0577 0.6624 0 1 0.7421 1 0.2778 0.8830 0.8830 0.2778 0.6052
4 0.6383 -0.4979 1.3250 0.0489 0 1 0.4144 0 0.1120 0.7842 0.1120 0.7842 0.6722

The full teaching dataframe contains both potential outcomes and the true CATE. The analyst-facing dataframe removes those truth columns because real data would not contain them.

Data Dictionary

The data dictionary separates observed analyst columns from oracle-only teaching columns. This is a habit worth keeping in all synthetic tutorials.

data_dictionary = pd.DataFrame(
    [
        {"column": "baseline_need", "role": "observed pre-treatment feature", "visible to analyst": True},
        {"column": "prior_engagement", "role": "observed pre-treatment feature", "visible to analyst": True},
        {"column": "account_tenure", "role": "observed pre-treatment feature", "visible to analyst": True},
        {"column": "friction_score", "role": "observed pre-treatment feature", "visible to analyst": True},
        {"column": "region_risk", "role": "observed pre-treatment feature", "visible to analyst": True},
        {"column": "high_need_segment", "role": "observed effect-modifier segment", "visible to analyst": True},
        {"column": "treatment", "role": "observed binary treatment", "visible to analyst": True},
        {"column": "observed_outcome", "role": "observed factual outcome", "visible to analyst": True},
        {"column": "propensity", "role": "true treatment probability from simulator", "visible to analyst": False},
        {"column": "y0", "role": "potential outcome under control", "visible to analyst": False},
        {"column": "y1", "role": "potential outcome under treatment", "visible to analyst": False},
        {"column": "missing_counterfactual", "role": "the potential outcome not observed for the assigned treatment", "visible to analyst": False},
        {"column": "true_cate", "role": "Y(1) - Y(0) in the simulator", "visible to analyst": False},
    ]
)

data_dictionary.to_csv(TABLE_DIR / "01_data_dictionary.csv", index=False)
display(data_dictionary)
column role visible to analyst
0 baseline_need observed pre-treatment feature True
1 prior_engagement observed pre-treatment feature True
2 account_tenure observed pre-treatment feature True
3 friction_score observed pre-treatment feature True
4 region_risk observed pre-treatment feature True
5 high_need_segment observed effect-modifier segment True
6 treatment observed binary treatment True
7 observed_outcome observed factual outcome True
8 propensity true treatment probability from simulator False
9 y0 potential outcome under control False
10 y1 potential outcome under treatment False
11 missing_counterfactual the potential outcome not observed for the ass... False
12 true_cate Y(1) - Y(0) in the simulator False

The oracle-only columns are what make the lesson possible. In real data, these columns are precisely what causal inference tries to reason about indirectly.

Basic Dataset Summary

Before estimands, we check the basic shape of the data and the distribution of treatment. This also gives the true effect quantities available only in the synthetic setup.

true_ate = potential_df["true_cate"].mean()
true_att = potential_df.loc[potential_df["treatment"] == 1, "true_cate"].mean()
true_atc = potential_df.loc[potential_df["treatment"] == 0, "true_cate"].mean()

basic_summary = pd.DataFrame(
    [
        {"quantity": "rows", "value": len(potential_df)},
        {"quantity": "treatment_rate", "value": potential_df["treatment"].mean()},
        {"quantity": "observed_outcome_mean", "value": potential_df["observed_outcome"].mean()},
        {"quantity": "true_ate", "value": true_ate},
        {"quantity": "true_att", "value": true_att},
        {"quantity": "true_atc", "value": true_atc},
        {"quantity": "true_cate_std", "value": potential_df["true_cate"].std()},
        {"quantity": "share_negative_true_cate", "value": (potential_df["true_cate"] < 0).mean()},
    ]
)

basic_summary.to_csv(TABLE_DIR / "01_basic_summary.csv", index=False)
display(basic_summary)
quantity value
0 rows 5,000.0000
1 treatment_rate 0.4904
2 observed_outcome_mean 0.2168
3 true_ate 0.5050
4 true_att 0.5771
5 true_atc 0.4355
6 true_cate_std 0.5345
7 share_negative_true_cate 0.1762

The ATE is positive, but the CATE standard deviation and negative-effect share tell us there is meaningful heterogeneity. A single average will hide some of that structure.

The Fundamental Missing-Data Problem

For each unit, we observe only the potential outcome corresponding to the treatment actually received. The other potential outcome is counterfactual.

observability_example = potential_df.head(10).copy()
observability_example["observed_y0"] = np.where(observability_example["treatment"] == 0, observability_example["y0"], np.nan)
observability_example["observed_y1"] = np.where(observability_example["treatment"] == 1, observability_example["y1"], np.nan)
observability_example["unobserved_counterfactual_label"] = np.where(
    observability_example["treatment"] == 1,
    "Y(0) is missing",
    "Y(1) is missing",
)

observability_display = observability_example[
    ["treatment", "observed_outcome", "observed_y0", "observed_y1", "unobserved_counterfactual_label", "true_cate"]
]
observability_display.to_csv(TABLE_DIR / "01_fundamental_problem_example.csv", index=False)
display(observability_display)
treatment observed_outcome observed_y0 observed_y1 unobserved_counterfactual_label true_cate
0 0 -2.4796 -2.4796 NaN Y(1) is missing -0.3490
1 0 1.6533 1.6533 NaN Y(1) is missing 0.3268
2 1 -1.0815 NaN -1.0815 Y(0) is missing -0.3303
3 1 0.8830 NaN 0.8830 Y(0) is missing 0.6052
4 0 0.1120 0.1120 NaN Y(1) is missing 0.6722
5 1 1.2231 NaN 1.2231 Y(0) is missing 0.2124
6 0 -0.0916 -0.0916 NaN Y(1) is missing 0.1911
7 1 1.5813 NaN 1.5813 Y(0) is missing 0.6849
8 1 -0.1073 NaN -0.1073 Y(0) is missing 0.6197
9 0 -0.9525 -0.9525 NaN Y(1) is missing 0.0157

The true_cate column is shown only because this is a simulation. In real data, we would not observe both Y(0) and Y(1) for the same row.

True Effect Quantities From Oracle Data

With synthetic potential outcomes, we can compute ATE, ATT, ATC, and segment CATE directly. This gives us a target for later estimation notebooks.

true_effect_summary = pd.DataFrame(
    [
        {
            "estimand": "ATE",
            "value": true_ate,
            "population": "all units",
            "formula in this simulation": "mean(Y1 - Y0)",
        },
        {
            "estimand": "ATT",
            "value": true_att,
            "population": "treated units",
            "formula in this simulation": "mean(Y1 - Y0 | T=1)",
        },
        {
            "estimand": "ATC",
            "value": true_atc,
            "population": "control units",
            "formula in this simulation": "mean(Y1 - Y0 | T=0)",
        },
    ]
)

true_effect_summary.to_csv(TABLE_DIR / "01_true_effect_summary.csv", index=False)
display(true_effect_summary)
estimand value population formula in this simulation
0 ATE 0.5050 all units mean(Y1 - Y0)
1 ATT 0.5771 treated units mean(Y1 - Y0 | T=1)
2 ATC 0.4355 control units mean(Y1 - Y0 | T=0)

ATE, ATT, and ATC differ because treatment assignment is related to features that also modify treatment effects. This is common in targeted observational systems.

Raw Difference Versus True ATE

The raw treated-control difference compares observed outcomes by treatment group. It is easy to compute, but it is not automatically causal.

raw_group_summary = (
    potential_df.groupby("treatment")
    .agg(
        rows=("observed_outcome", "size"),
        mean_observed_outcome=("observed_outcome", "mean"),
        mean_true_cate=("true_cate", "mean"),
        mean_propensity=("propensity", "mean"),
    )
    .reset_index()
)

raw_difference = (
    raw_group_summary.loc[raw_group_summary["treatment"] == 1, "mean_observed_outcome"].iloc[0]
    - raw_group_summary.loc[raw_group_summary["treatment"] == 0, "mean_observed_outcome"].iloc[0]
)

raw_vs_truth = pd.DataFrame(
    [
        {"quantity": "raw treated-control difference", "value": raw_difference},
        {"quantity": "true ATE", "value": true_ate},
        {"quantity": "raw bias versus true ATE", "value": raw_difference - true_ate},
    ]
)

raw_group_summary.to_csv(TABLE_DIR / "01_raw_group_summary.csv", index=False)
raw_vs_truth.to_csv(TABLE_DIR / "01_raw_difference_vs_truth.csv", index=False)

display(raw_group_summary)
display(raw_vs_truth)
treatment rows mean_observed_outcome mean_true_cate mean_propensity
0 0 2548 -0.2963 0.4355 0.3846
1 1 2452 0.7501 0.5771 0.5752
quantity value
0 raw treated-control difference 1.0464
1 true ATE 0.5050
2 raw bias versus true ATE 0.5415

The raw difference is biased because treatment is observational. Treated units have different baseline features and different treatment-effect profiles.

Confounding Check With Covariate Balance

If treatment were randomized, pre-treatment covariates would be similar across treatment groups up to sampling noise. Here they are not.

PRE_TREATMENT_FEATURES = [
    "baseline_need",
    "prior_engagement",
    "account_tenure",
    "friction_score",
    "region_risk",
    "high_need_segment",
]
X_EFFECT_MODIFIERS = ["baseline_need", "prior_engagement", "friction_score", "high_need_segment"]
W_CONTROLS = ["account_tenure", "region_risk"]


def standardized_mean_difference(data, column, treatment_col="treatment"):
    treated = data.loc[data[treatment_col] == 1, column]
    control = data.loc[data[treatment_col] == 0, column]
    pooled_sd = np.sqrt((treated.var(ddof=1) + control.var(ddof=1)) / 2)
    return (treated.mean() - control.mean()) / pooled_sd if pooled_sd > 0 else 0.0

balance_table = pd.DataFrame(
    [
        {
            "feature": column,
            "control_mean": potential_df.loc[potential_df["treatment"] == 0, column].mean(),
            "treated_mean": potential_df.loc[potential_df["treatment"] == 1, column].mean(),
            "standardized_mean_difference": standardized_mean_difference(potential_df, column),
        }
        for column in PRE_TREATMENT_FEATURES
    ]
).sort_values("standardized_mean_difference", key=lambda values: values.abs(), ascending=False)

balance_table.to_csv(TABLE_DIR / "01_covariate_balance.csv", index=False)
display(balance_table)
feature control_mean treated_mean standardized_mean_difference
0 baseline_need -0.3046 0.3015 0.6330
5 high_need_segment 0.3870 0.6321 0.5058
1 prior_engagement -0.1981 0.2125 0.4163
3 friction_score -0.1632 0.2063 0.3706
2 account_tenure 0.1198 -0.1488 -0.2727
4 region_risk 0.3148 0.3679 0.1122

The standardized differences show substantial pre-treatment imbalance. This is why later estimators need controls and nuisance models.

Plot Covariate Imbalance

The dashed lines at +/-0.1 are rough guides. Values outside the band suggest meaningful imbalance.

fig, ax = plt.subplots(figsize=(9, 5.2))
sns.barplot(
    data=balance_table,
    x="standardized_mean_difference",
    y="feature",
    hue="feature",
    dodge=False,
    palette="viridis",
    legend=False,
    ax=ax,
)
ax.axvline(0, color="#111827", linewidth=1)
ax.axvline(0.1, color="#64748b", linestyle="--", linewidth=1)
ax.axvline(-0.1, color="#64748b", linestyle="--", linewidth=1)
ax.set_title("Pre-Treatment Feature Imbalance")
ax.set_xlabel("Standardized Mean Difference")
ax.set_ylabel("Feature")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_covariate_imbalance.png", dpi=160, bbox_inches="tight")
plt.show()

The imbalance plot explains why an unadjusted comparison is not credible. It also previews the role of treatment nuisance models in DML and DR learners.

Overlap Check

Overlap asks whether treatment and control units exist in the same feature regions. We fit a simple propensity model to visualize estimated assignment probabilities.

propensity_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1_000, solver="lbfgs"),
)
propensity_model.fit(potential_df[PRE_TREATMENT_FEATURES], potential_df["treatment"])
potential_df["estimated_propensity"] = propensity_model.predict_proba(potential_df[PRE_TREATMENT_FEATURES])[:, 1]

propensity_auc = roc_auc_score(potential_df["treatment"], potential_df["estimated_propensity"])
propensity_summary = potential_df["estimated_propensity"].describe().to_frame("estimated_propensity").reset_index()
propensity_summary = propensity_summary.rename(columns={"index": "summary"})
propensity_summary.loc[len(propensity_summary)] = {"summary": "roc_auc", "estimated_propensity": propensity_auc}

propensity_summary.to_csv(TABLE_DIR / "01_propensity_summary.csv", index=False)
print(f"Estimated propensity ROC AUC: {propensity_auc:.3f}")
display(propensity_summary)
Estimated propensity ROC AUC: 0.752
summary estimated_propensity
0 count 5,000.0000
1 mean 0.4904
2 std 0.2195
3 min 0.0237
4 25% 0.3182
5 50% 0.4830
6 75% 0.6640
7 max 0.9838
8 roc_auc 0.7523

The treatment model is predictive, confirming that treatment assignment is not random. The distribution still needs to have enough overlap for causal comparison.

Plot Propensity Overlap

This plot compares estimated propensities for treated and control units. Severe separation would make CATE estimation much more fragile.

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(
    data=potential_df,
    x="estimated_propensity",
    hue="treatment",
    bins=40,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    linewidth=2,
    ax=ax,
)
ax.set_title("Estimated Propensity Overlap")
ax.set_xlabel("Estimated Propensity Of Treatment")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The overlap is usable for a teaching example. Some tail regions are thinner, which is exactly where individual CATE estimates would be less trustworthy.

True CATE Distribution

CATE is a distribution, not just one number. We plot the true synthetic CATE to see how much heterogeneity exists.

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(potential_df, x="true_cate", bins=45, kde=True, color="#2563eb", ax=ax)
ax.axvline(true_ate, color="#111827", linestyle="--", linewidth=1.4, label="true ATE")
ax.axvline(0, color="#ef4444", linestyle=":", linewidth=1.2, label="zero effect")
ax.set_title("True CATE Distribution")
ax.set_xlabel("Y(1) - Y(0)")
ax.set_ylabel("Units")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_true_cate_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

The distribution makes the core EconML motivation visible. Some units have much larger expected benefit than the average, while a smaller share may have little or negative benefit.

Segment-Level CATE

CATE becomes easier to communicate when summarized over meaningful segments. Here we use the high-need segment and friction buckets.

potential_df["friction_bucket"] = pd.qcut(
    potential_df["friction_score"],
    q=4,
    labels=["lowest friction", "low-mid friction", "high-mid friction", "highest friction"],
)
potential_df["need_segment_label"] = np.where(
    potential_df["high_need_segment"] == 1,
    "higher baseline need",
    "lower baseline need",
)

segment_cate = (
    potential_df.groupby(["need_segment_label", "friction_bucket"], observed=True)
    .agg(
        rows=("true_cate", "size"),
        true_cate_mean=("true_cate", "mean"),
        true_cate_median=("true_cate", "median"),
        treatment_rate=("treatment", "mean"),
    )
    .reset_index()
)

segment_cate.to_csv(TABLE_DIR / "01_segment_cate.csv", index=False)
display(segment_cate)
need_segment_label friction_bucket rows true_cate_mean true_cate_median treatment_rate
0 higher baseline need lowest friction 653 1.2581 1.2277 0.4778
1 higher baseline need low-mid friction 639 0.9237 0.9496 0.5884
2 higher baseline need high-mid friction 593 0.6815 0.6849 0.6526
3 higher baseline need highest friction 651 0.3517 0.3446 0.7296
4 lower baseline need lowest friction 597 0.6441 0.6322 0.2647
5 lower baseline need low-mid friction 611 0.3209 0.3194 0.3273
6 lower baseline need high-mid friction 657 0.0825 0.0819 0.3866
7 lower baseline need highest friction 599 -0.2585 -0.2428 0.4841

The segment table shows meaningful variation. Higher need tends to raise treatment benefit, while higher friction tends to reduce it in this simulator.

Plot Segment-Level CATE

A heatmap makes the two-way heterogeneity pattern easier to scan.

segment_heatmap = segment_cate.pivot(
    index="need_segment_label",
    columns="friction_bucket",
    values="true_cate_mean",
)

fig, ax = plt.subplots(figsize=(10, 4.8))
sns.heatmap(
    segment_heatmap,
    annot=True,
    fmt=".3f",
    cmap="YlGnBu",
    linewidths=0.5,
    cbar_kws={"label": "Mean true CATE"},
    ax=ax,
)
ax.set_title("True CATE By Need Segment And Friction Bucket")
ax.set_xlabel("Friction Bucket")
ax.set_ylabel("Need Segment")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_segment_cate_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

This plot shows why targeting can matter. Treating every unit the same would ignore large differences in expected benefit.

CATE Drivers In The Oracle Data

Because true CATE is known, we can regress it on features to show which variables drive heterogeneity. In real data, this would be replaced by model-based explanation of estimated CATE.

oracle_cate_model = sm.OLS(
    potential_df["true_cate"],
    sm.add_constant(potential_df[X_EFFECT_MODIFIERS]),
).fit()

oracle_driver_table = (
    oracle_cate_model.params.rename("coefficient")
    .to_frame()
    .join(oracle_cate_model.bse.rename("standard_error"))
    .reset_index(names="feature")
)

oracle_driver_table.to_csv(TABLE_DIR / "01_oracle_cate_drivers.csv", index=False)
display(oracle_driver_table)
feature coefficient standard_error
0 const 0.2825 0.0007
1 baseline_need 0.0995 0.0006
2 prior_engagement 0.2500 0.0003
3 friction_score -0.3498 0.0003
4 high_need_segment 0.4510 0.0011

The oracle regression recovers the simulator logic: high need and prior engagement raise the effect, while friction lowers it. Later notebooks will try to learn this from observed outcomes only.

Naive Segment Effects Versus True Segment Effects

A common first attempt is to compute treated-control differences inside segments. This can still be biased if treatment is confounded within those segments.

naive_segment_rows = []
for segment_name, segment_df in potential_df.groupby("need_segment_label"):
    treated_mean = segment_df.loc[segment_df["treatment"] == 1, "observed_outcome"].mean()
    control_mean = segment_df.loc[segment_df["treatment"] == 0, "observed_outcome"].mean()
    naive_difference = treated_mean - control_mean
    true_segment_effect = segment_df["true_cate"].mean()
    naive_segment_rows.append(
        {
            "segment": segment_name,
            "rows": len(segment_df),
            "naive_treated_control_difference": naive_difference,
            "true_segment_cate": true_segment_effect,
            "bias": naive_difference - true_segment_effect,
        }
    )

naive_segment_effects = pd.DataFrame(naive_segment_rows)
naive_segment_effects.to_csv(TABLE_DIR / "01_naive_segment_effects.csv", index=False)
display(naive_segment_effects)
segment rows naive_treated_control_difference true_segment_cate bias
0 higher baseline need 2536 1.0013 0.8063 0.1949
1 lower baseline need 2464 0.4524 0.1948 0.2576

Even segment-level comparisons can be biased. Segmenting does not automatically solve confounding; it only changes the population being compared.

A Simple Interaction Regression Bridge

Before using EconML, we can fit a transparent baseline: an outcome regression with treatment-feature interactions. This is not the final estimator, but it shows the shape of a CATE model.

interaction_df = potential_df.copy()
for feature in X_EFFECT_MODIFIERS:
    interaction_df[f"treatment_x_{feature}"] = interaction_df["treatment"] * interaction_df[feature]

interaction_features = (
    ["treatment"]
    + PRE_TREATMENT_FEATURES
    + [f"treatment_x_{feature}" for feature in X_EFFECT_MODIFIERS]
)

interaction_model = sm.OLS(
    interaction_df["observed_outcome"],
    sm.add_constant(interaction_df[interaction_features]),
).fit()

interaction_coefficients = (
    interaction_model.params.rename("coefficient")
    .to_frame()
    .join(interaction_model.bse.rename("standard_error"))
    .reset_index(names="term")
)
interaction_coefficients.to_csv(TABLE_DIR / "01_interaction_regression_coefficients.csv", index=False)
display(interaction_coefficients.head(15))
term coefficient standard_error
0 const -0.0022 0.0288
1 treatment 0.2854 0.0404
2 baseline_need 0.6565 0.0235
3 prior_engagement 0.3428 0.0142
4 account_tenure -0.2604 0.0101
5 friction_score -0.3528 0.0142
6 region_risk -0.1553 0.0210
7 high_need_segment -0.0097 0.0462
8 treatment_x_baseline_need 0.0840 0.0337
9 treatment_x_prior_engagement 0.2638 0.0202
10 treatment_x_friction_score -0.3584 0.0200
11 treatment_x_high_need_segment 0.4525 0.0664

The interaction terms are a simple way to let treatment effects vary with features. EconML estimators generalize this idea with more careful nuisance modeling and flexible final stages.

Recover CATE From The Interaction Regression

For the interaction model, the estimated CATE is the treatment coefficient plus the treatment-feature interaction terms for each row.

base_treatment_coef = interaction_model.params["treatment"]
estimated_cate_interaction = np.full(len(interaction_df), base_treatment_coef)

for feature in X_EFFECT_MODIFIERS:
    estimated_cate_interaction += interaction_model.params[f"treatment_x_{feature}"] * interaction_df[feature]

interaction_df["estimated_cate_interaction"] = estimated_cate_interaction

interaction_metrics = pd.DataFrame(
    [
        {"metric": "true ATE", "value": interaction_df["true_cate"].mean()},
        {"metric": "interaction-regression estimated ATE", "value": interaction_df["estimated_cate_interaction"].mean()},
        {"metric": "CATE correlation with truth", "value": np.corrcoef(interaction_df["true_cate"], interaction_df["estimated_cate_interaction"])[0, 1]},
        {"metric": "CATE RMSE", "value": np.sqrt(mean_squared_error(interaction_df["true_cate"], interaction_df["estimated_cate_interaction"]))},
    ]
)

interaction_metrics.to_csv(TABLE_DIR / "01_interaction_cate_metrics.csv", index=False)
display(interaction_metrics)
metric value
0 true ATE 0.5050
1 interaction-regression estimated ATE 0.5087
2 CATE correlation with truth 0.9982
3 CATE RMSE 0.0328

The simple interaction model performs well here because the simulator is mostly linear. Later notebooks will show why we need stronger tools when nuisance functions or CATE patterns are more complex.

Plot Estimated Versus True CATE For The Simple Baseline

This plot checks whether the interaction regression learns the treatment-effect ranking, not just the average.

fig, ax = plt.subplots(figsize=(7, 6))
sns.scatterplot(
    data=interaction_df.sample(1_200, random_state=RANDOM_SEED),
    x="true_cate",
    y="estimated_cate_interaction",
    alpha=0.45,
    s=28,
    edgecolor=None,
    ax=ax,
)
min_value = min(interaction_df["true_cate"].min(), interaction_df["estimated_cate_interaction"].min())
max_value = max(interaction_df["true_cate"].max(), interaction_df["estimated_cate_interaction"].max())
ax.plot([min_value, max_value], [min_value, max_value], color="#111827", linestyle="--", linewidth=1.2)
ax.set_title("Simple Interaction Regression: Estimated Versus True CATE")
ax.set_xlabel("True CATE")
ax.set_ylabel("Estimated CATE")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_interaction_estimated_vs_true_cate.png", dpi=160, bbox_inches="tight")
plt.show()

The baseline recovers the broad pattern in this friendly simulation. EconML becomes more valuable as the data-generating process becomes less friendly.

CATE As A Treatment-Targeting Signal

CATE estimates are often used to prioritize treatment. With oracle potential outcomes, we can compare simple targeting policies using the true treatment effects.

policy_df = potential_df.copy()
policy_df["oracle_rank"] = policy_df["true_cate"].rank(ascending=False, method="first")
policy_df["treat_top_30_percent"] = (policy_df["oracle_rank"] <= 0.30 * len(policy_df)).astype(int)
policy_df["treat_positive_effect_only"] = (policy_df["true_cate"] > 0).astype(int)
policy_df["treat_random_30_percent"] = rng.binomial(1, 0.30, size=len(policy_df))


def oracle_policy_value(treatment_rule):
    return np.mean(np.where(treatment_rule == 1, policy_df["y1"], policy_df["y0"]))

policy_summary = pd.DataFrame(
    [
        {"policy": "treat none", "treated_share": 0.0, "oracle_value": oracle_policy_value(np.zeros(len(policy_df)))},
        {"policy": "treat everyone", "treated_share": 1.0, "oracle_value": oracle_policy_value(np.ones(len(policy_df)))},
        {"policy": "random 30 percent", "treated_share": policy_df["treat_random_30_percent"].mean(), "oracle_value": oracle_policy_value(policy_df["treat_random_30_percent"])},
        {"policy": "oracle top 30 percent by CATE", "treated_share": policy_df["treat_top_30_percent"].mean(), "oracle_value": oracle_policy_value(policy_df["treat_top_30_percent"])},
        {"policy": "oracle positive CATE only", "treated_share": policy_df["treat_positive_effect_only"].mean(), "oracle_value": oracle_policy_value(policy_df["treat_positive_effect_only"])},
    ]
)
policy_summary["value_lift_vs_treat_none"] = policy_summary["oracle_value"] - policy_summary.loc[policy_summary["policy"] == "treat none", "oracle_value"].iloc[0]

policy_summary.to_csv(TABLE_DIR / "01_oracle_policy_summary.csv", index=False)
display(policy_summary)
policy treated_share oracle_value value_lift_vs_treat_none
0 treat none 0.0000 -0.0662 0.0000
1 treat everyone 1.0000 0.4388 0.5050
2 random 30 percent 0.2972 0.0854 0.1516
3 oracle top 30 percent by CATE 0.3000 0.2721 0.3383
4 oracle positive CATE only 0.8238 0.4881 0.5543

The oracle CATE policies outperform random targeting because they concentrate treatment where benefit is highest. Real CATE models try to approximate this ranking without observing the oracle truth.

Plot Oracle Policy Values

The policy plot shows why heterogeneity matters operationally. A good treatment rule can create more value with fewer treated units.

plot_policy_summary = policy_summary.sort_values("oracle_value")

fig, ax = plt.subplots(figsize=(10, 5.5))
sns.barplot(
    data=plot_policy_summary,
    x="oracle_value",
    y="policy",
    hue="policy",
    dodge=False,
    palette="viridis",
    legend=False,
    ax=ax,
)
ax.set_title("Oracle Policy Value Under Different Treatment Rules")
ax.set_xlabel("Mean Potential Outcome Under Policy")
ax.set_ylabel("")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_oracle_policy_values.png", dpi=160, bbox_inches="tight")
plt.show()

The oracle policy is not achievable in real data, but it gives the north star for treatment-targeting notebooks: learn a ranking that improves policy value without relying on hidden truth.

Why EconML Needs X And W

This table connects the foundation concepts to the data roles used by EconML estimators.

x_w_role_summary = pd.DataFrame(
    [
        {
            "role": "X: effect modifiers",
            "columns in this notebook": ", ".join(X_EFFECT_MODIFIERS),
            "why it matters": "CATE is modeled as a function of these features.",
        },
        {
            "role": "W: controls",
            "columns in this notebook": ", ".join(W_CONTROLS),
            "why it matters": "Controls help nuisance models adjust for confounding.",
        },
        {
            "role": "T: treatment",
            "columns in this notebook": "treatment",
            "why it matters": "The intervention whose effect is estimated.",
        },
        {
            "role": "Y: outcome",
            "columns in this notebook": "observed_outcome",
            "why it matters": "Only the factual outcome is observed in real data.",
        },
    ]
)

x_w_role_summary.to_csv(TABLE_DIR / "01_x_w_role_summary.csv", index=False)
display(x_w_role_summary)
role columns in this notebook why it matters
0 X: effect modifiers baseline_need, prior_engagement, friction_scor... CATE is modeled as a function of these features.
1 W: controls account_tenure, region_risk Controls help nuisance models adjust for confo...
2 T: treatment treatment The intervention whose effect is estimated.
3 Y: outcome observed_outcome Only the factual outcome is observed in real d...

The same variable can sometimes be both a confounder and an effect modifier. The X and W split is a modeling choice that should follow the causal question.

Foundation Checklist

Before fitting an EconML estimator, this checklist should be clear. It keeps CATE modeling connected to causal design rather than pure prediction.

foundation_checklist = pd.DataFrame(
    [
        {"check": "Treatment and outcome are defined", "status in this notebook": "treatment and observed_outcome"},
        {"check": "Potential-outcome estimand is named", "status in this notebook": "ATE, ATT, ATC, and CATE"},
        {"check": "Observed features are pre-treatment", "status in this notebook": "baseline features only"},
        {"check": "Confounding is diagnosed", "status in this notebook": "covariate balance table and plot"},
        {"check": "Overlap is diagnosed", "status in this notebook": "propensity summary and histogram"},
        {"check": "Effect modifiers are named", "status in this notebook": ", ".join(X_EFFECT_MODIFIERS)},
        {"check": "Controls are named", "status in this notebook": ", ".join(W_CONTROLS)},
        {"check": "A simple baseline is available", "status in this notebook": "interaction regression CATE baseline"},
        {"check": "Targeting use case is explicit", "status in this notebook": "oracle policy-value comparison"},
    ]
)

foundation_checklist.to_csv(TABLE_DIR / "01_foundation_checklist.csv", index=False)
display(foundation_checklist)
check status in this notebook
0 Treatment and outcome are defined treatment and observed_outcome
1 Potential-outcome estimand is named ATE, ATT, ATC, and CATE
2 Observed features are pre-treatment baseline features only
3 Confounding is diagnosed covariate balance table and plot
4 Overlap is diagnosed propensity summary and histogram
5 Effect modifiers are named baseline_need, prior_engagement, friction_scor...
6 Controls are named account_tenure, region_risk
7 A simple baseline is available interaction regression CATE baseline
8 Targeting use case is explicit oracle policy-value comparison

The checklist is intentionally estimator-agnostic. It should be completed before choosing LinearDML, CausalForestDML, DRLearner, or any other method.

Final Summary

This notebook introduced the potential-outcomes foundation for the EconML series.

Key takeaways:

  • Real data reveal only one potential outcome per unit.
  • ATE, ATT, ATC, and CATE answer different population questions.
  • Raw treated-control differences can be badly biased in observational data.
  • CATE is useful because treatment effects vary across feature-defined groups.
  • Segment summaries and policy values show why heterogeneity matters for decisions.
  • Later EconML estimators try to recover CATE from observed data using nuisance models, effect modifiers, and assumptions about confounding and overlap.

The next notebook moves from these foundations to double machine learning: residualization, orthogonalization, nuisance models, and cross-fitting.