DoubleML Tutorial 14: Heterogeneous Treatment Effects, GATE, CATE, And BLP
This notebook moves from the average treatment effect to effect heterogeneity: the idea that the same treatment can help different units by different amounts. In product analytics, policy analysis, experimentation, marketplace design, and recommender systems, this is usually where causal inference becomes operational. A single average effect can tell us whether a feature, treatment, or intervention is useful on average. It does not tell us where the value is concentrated, which users may be harmed, or which subgroup deserves a more careful rollout.
The DoubleML tools in this notebook are built around the same orthogonal-score logic used earlier in the tutorial series. We first estimate a valid global causal effect under standard identification assumptions. Then we use DoubleML’s best linear predictor machinery to summarize heterogeneity in two related ways:
GATE, or group average treatment effects, asks: what is the average effect inside each pre-defined group?
CATE, or conditional average treatment effects, asks: how does the effect vary with observed covariates?
BLP, or best linear predictor, is the practical bridge: it projects the estimated individual-level orthogonal signal onto a chosen set of features or basis functions. This does not magically recover every nonlinear individual effect. It gives a disciplined, uncertainty-aware summary of the effect pattern represented by the chosen basis.
The most important theme is humility. Heterogeneity analysis is easy to over-narrate. A subgroup with a high estimated effect is not automatically a deployable targeting rule. It may be noisy, it may depend on a modeling choice, and it is still only as credible as the causal identification design behind the original DoubleML model.
Setup
This setup cell prepares the shared tutorial output folders, imports the libraries used throughout the notebook, and prints the DoubleML version. The code is intentionally visible because environment setup is part of a reproducible causal workflow.
A small note on warnings: the notebook suppresses common future/deprecation chatter and the familiar scikit-learn feature-name warning that can appear when estimators move between pandas and NumPy internally. It does not hide model failures or causal-design problems.
from pathlib import Pathimport osimport warnings# Find the repository root from wherever the notebook is executed.PROJECT_ROOT = Path.cwd().resolve()whilenot (PROJECT_ROOT /"pyproject.toml").exists() and PROJECT_ROOT != PROJECT_ROOT.parent: PROJECT_ROOT = PROJECT_ROOT.parentOUTPUT_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"doubleml"/"outputs"DATASET_DIR = OUTPUT_DIR /"datasets"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"REPORT_DIR = OUTPUT_DIR /"reports"MATPLOTLIB_CACHE_DIR = OUTPUT_DIR /"matplotlib_cache"for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]: directory.mkdir(parents=True, exist_ok=True)# Set Matplotlib's cache before importing pyplot so notebook execution stays quiet.os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message="IProgress not found.*")warnings.filterwarnings("ignore", message="X does not have valid feature names.*")import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom IPython.display import Markdown, displayfrom matplotlib.patches import FancyArrowPatch, FancyBboxPatchfrom sklearn.base import clonefrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.metrics import brier_score_loss, log_loss, mean_squared_errorimport doubleml as dmlfrom doubleml import DoubleMLData, DoubleMLIRMNOTEBOOK_PREFIX ="14"RANDOM_SEED =140sns.set_theme(style="whitegrid", context="talk")pd.set_option("display.max_columns", 80)pd.set_option("display.float_format", "{:.4f}".format)print(f"Project root: {PROJECT_ROOT}")print(f"DoubleML version: {dml.__version__}")print(f"Outputs will be written to: {OUTPUT_DIR}")
Project root: /home/apex/Documents/ranking_sys
DoubleML version: 0.11.2
Outputs will be written to: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs
The setup confirms that the notebook is using the local environment and the shared DoubleML tutorial output folder. Every saved artifact in this notebook uses prefix 14, so the files remain easy to connect back to this notebook.
Helper Functions
The helper functions below keep repeated code readable. The most important helper is build_cate_basis(): it defines the basis used for the CATE/BLP projection. In real work, this basis should be chosen before looking at a large number of subgroup results. Otherwise heterogeneity analysis can quietly turn into an exploratory search for whatever story looks strongest.
def save_table(df, filename):"""Save a table in the tutorial output folder and return the unchanged DataFrame.""" path = TABLE_DIR / filename df.to_csv(path, index=False)return dfdef save_dataset(df, filename):"""Save a dataset in the tutorial output folder and return the unchanged DataFrame.""" path = DATASET_DIR / filename df.to_csv(path, index=False)return dfdef sigmoid(values):"""Numerically stable logistic transform for synthetic propensity scores.""" values = np.asarray(values)return1.0/ (1.0+ np.exp(-values))def rmse(y_true, y_pred):"""Root mean squared error, used for outcome-nuisance diagnostics."""returnfloat(np.sqrt(mean_squared_error(y_true, y_pred)))def build_cate_basis(frame):"""Construct the linear basis used by DoubleML's CATE/BLP method."""return pd.DataFrame( {"intercept": 1.0,"engagement_z": frame["engagement_z"].to_numpy(),"price_sensitivity_z": frame["price_sensitivity_z"].to_numpy(),"high_engagement_flag": (frame["engagement_z"] >0.5).astype(float).to_numpy(),"price_sensitive_flag": (frame["price_sensitivity_z"] >0.5).astype(float).to_numpy(),"category_affinity_z": frame["category_affinity_z"].to_numpy(),"engagement_x_prior": (frame["engagement_z"] * frame["prior_value_z"]).to_numpy(), } )def make_irm_data(frame, x_cols):"""Create the DoubleMLData backend for a binary-treatment IRM example."""return DoubleMLData(frame, y_col="outcome", d_cols="treatment", x_cols=x_cols)def ci_error_columns(df, estimate_col="effect", lower_col="ci_lower", upper_col="ci_upper"):"""Return asymmetric error-bar arrays for Matplotlib."""return np.vstack( [ df[estimate_col].to_numpy() - df[lower_col].to_numpy(), df[upper_col].to_numpy() - df[estimate_col].to_numpy(), ] )
These helpers encode two practical habits. First, every output is saved as a named artifact instead of only living in the notebook. Second, CATE summaries are tied to an explicit basis, which makes the heterogeneity claim auditable.
Heterogeneity Vocabulary
Before fitting models, it helps to separate the estimands. These terms are often used loosely, but they answer different questions.
ATE averages the treatment effect over the whole target population.
GATE averages the treatment effect inside a group such as a tenure band, engagement segment, or risk tier.
CATE conditions on covariates and is often written as tau(x) = E[Y(1) - Y(0) | X = x].
BLP of CATE is a linear projection of the CATE signal onto chosen basis features. It is a summary of heterogeneity, not a guarantee that the true effect is linear.
Effect modifier is a covariate that changes the treatment effect itself, not merely a covariate that predicts the outcome.
This table is saved because these definitions are useful language for a methods appendix or README.
hte_vocabulary = pd.DataFrame( [ {"term": "ATE","meaning": "Average treatment effect across the target population.","typical_question": "Does the treatment help on average?", }, {"term": "GATE","meaning": "Group average treatment effect for mutually exclusive or deliberately defined groups.","typical_question": "Which pre-defined segment appears to benefit more?", }, {"term": "CATE","meaning": "Conditional average treatment effect as a function of observed covariates.","typical_question": "How does the effect vary with user or item features?", }, {"term": "BLP","meaning": "Best linear predictor of a heterogeneous effect signal using a chosen basis.","typical_question": "Which simple basis summarizes effect heterogeneity well?", }, {"term": "Effect modifier","meaning": "A variable that changes the treatment effect, not just the baseline outcome.","typical_question": "Does the causal effect itself change when this feature changes?", }, ])save_table(hte_vocabulary, f"{NOTEBOOK_PREFIX}_hte_vocabulary.csv")display(hte_vocabulary)
term
meaning
typical_question
0
ATE
Average treatment effect across the target pop...
Does the treatment help on average?
1
GATE
Group average treatment effect for mutually ex...
Which pre-defined segment appears to benefit m...
2
CATE
Conditional average treatment effect as a func...
How does the effect vary with user or item fea...
3
BLP
Best linear predictor of a heterogeneous effec...
Which simple basis summarizes effect heterogen...
4
Effect modifier
A variable that changes the treatment effect, ...
Does the causal effect itself change when this...
The key distinction is between outcome prediction and effect modification. A feature can be highly predictive of Y while having no relationship to Y(1) - Y(0). Heterogeneity analysis is about the second object.
Synthetic Heterogeneous-Effect Design
We now create a synthetic binary-treatment dataset where the true treatment effect is known. The outcome is continuous, the treatment is assigned with confounding through observed covariates, and the treatment effect varies by engagement, price sensitivity, category affinity, and an engagement-by-prior-value interaction.
This is not meant to imitate a specific company or product. It is a teaching design that lets us compare DoubleML’s estimated heterogeneity summaries against an oracle true_tau column. In real observational data, true_tau is never observed; it appears here only so the tutorial can show what the estimators are trying to recover.
rng = np.random.default_rng(RANDOM_SEED)n_obs =1_600x_cols = ["engagement_z","tenure_z","price_sensitivity_z","category_affinity_z","prior_value_z","support_load_z",]hte_df = pd.DataFrame(rng.normal(size=(n_obs, len(x_cols))), columns=x_cols)# The treatment assignment is confounded: the same observed features affect both treatment and outcome.propensity_logit = (-0.15+0.55* hte_df["engagement_z"]-0.38* hte_df["price_sensitivity_z"]+0.25* hte_df["prior_value_z"]-0.22* hte_df["support_load_z"]+0.18* hte_df["engagement_z"] * hte_df["category_affinity_z"])hte_df["propensity_true"] =0.08+0.84* sigmoid(propensity_logit)hte_df["treatment"] = rng.binomial(1, hte_df["propensity_true"])# The true effect is deliberately heterogeneous and partly nonlinear.hte_df["high_engagement_flag"] = (hte_df["engagement_z"] >0.5).astype(int)hte_df["price_sensitive_flag"] = (hte_df["price_sensitivity_z"] >0.5).astype(int)hte_df["true_tau"] = (0.45+0.32* hte_df["high_engagement_flag"]-0.22* hte_df["price_sensitive_flag"]+0.16* hte_df["category_affinity_z"]+0.10* hte_df["engagement_z"] * hte_df["prior_value_z"])# This segment is mutually exclusive so it is valid for the GATE call below.hte_df["segment"] = np.select( [hte_df["engagement_z"] >0.7, hte_df["price_sensitivity_z"] >0.7], ["high_engagement", "price_sensitive"], default="general",)baseline_outcome = (2.0+0.60* hte_df["engagement_z"]+0.25* hte_df["tenure_z"]-0.45* hte_df["support_load_z"]+0.30* hte_df["prior_value_z"]+0.20* np.sin(hte_df["category_affinity_z"]))hte_df["outcome"] = baseline_outcome + hte_df["true_tau"] * hte_df["treatment"] + rng.normal(0, 0.9, n_obs)save_dataset(hte_df, f"{NOTEBOOK_PREFIX}_synthetic_hte_data.csv")display(hte_df.head())
engagement_z
tenure_z
price_sensitivity_z
category_affinity_z
prior_value_z
support_load_z
propensity_true
treatment
high_engagement_flag
price_sensitive_flag
true_tau
segment
outcome
0
-1.1761
1.7999
-1.1815
2.4726
-1.6690
0.4317
0.2484
1
0
0
1.0419
general
1.8086
1
-0.8547
-1.3900
-0.8436
-2.7860
1.1934
0.4980
0.5662
1
0
0
-0.0977
general
1.0207
2
-0.3210
-0.5196
1.6460
-1.6166
0.7760
-0.2077
0.3740
0
0
1
-0.0536
price_sensitive
3.0849
3
-0.1585
-0.0587
-0.4249
-0.6167
-0.1388
-1.3622
0.5433
0
0
0
0.3535
general
1.2246
4
0.7556
1.2016
1.8237
0.6995
0.2417
0.5579
0.4182
0
1
1
0.6802
high_engagement
0.8292
The first rows show the causal ingredients: observed controls, true treatment propensity, assigned treatment, true treatment effect, segment label, and observed outcome. The model will not use true_tau or propensity_true; those columns are retained only for teaching diagnostics.
Field Dictionary
This field dictionary documents every column in the synthetic dataset. A clear data dictionary is especially useful for heterogeneity work because subgroup definitions can otherwise become ambiguous.
field_dictionary = pd.DataFrame( [ {"column": "engagement_z", "role": "Observed control / effect modifier", "description": "Standardized prior engagement signal."}, {"column": "tenure_z", "role": "Observed control", "description": "Standardized account age or relationship length signal."}, {"column": "price_sensitivity_z", "role": "Observed control / effect modifier", "description": "Standardized signal for sensitivity to cost, friction, or effort."}, {"column": "category_affinity_z", "role": "Observed control / effect modifier", "description": "Standardized affinity for the treatment-relevant category."}, {"column": "prior_value_z", "role": "Observed control / interaction feature", "description": "Standardized prior value or historical outcome signal."}, {"column": "support_load_z", "role": "Observed control", "description": "Standardized friction, complaint, or support burden signal."}, {"column": "propensity_true", "role": "Oracle diagnostic", "description": "True treatment probability used by the simulator; not used by DoubleML."}, {"column": "treatment", "role": "Treatment", "description": "Binary treatment/exposure indicator."}, {"column": "high_engagement_flag", "role": "Effect-modifier basis feature", "description": "Indicator for engagement_z above 0.5."}, {"column": "price_sensitive_flag", "role": "Effect-modifier basis feature", "description": "Indicator for price_sensitivity_z above 0.5."}, {"column": "true_tau", "role": "Oracle diagnostic", "description": "True individual treatment effect from the simulator; unavailable in real data."}, {"column": "segment", "role": "GATE group", "description": "Mutually exclusive teaching segment used for group effect estimation."}, {"column": "outcome", "role": "Outcome", "description": "Observed continuous outcome after treatment assignment."}, ])save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")display(field_dictionary)
column
role
description
0
engagement_z
Observed control / effect modifier
Standardized prior engagement signal.
1
tenure_z
Observed control
Standardized account age or relationship lengt...
2
price_sensitivity_z
Observed control / effect modifier
Standardized signal for sensitivity to cost, f...
3
category_affinity_z
Observed control / effect modifier
Standardized affinity for the treatment-releva...
4
prior_value_z
Observed control / interaction feature
Standardized prior value or historical outcome...
5
support_load_z
Observed control
Standardized friction, complaint, or support b...
6
propensity_true
Oracle diagnostic
True treatment probability used by the simulat...
7
treatment
Treatment
Binary treatment/exposure indicator.
8
high_engagement_flag
Effect-modifier basis feature
Indicator for engagement_z above 0.5.
9
price_sensitive_flag
Effect-modifier basis feature
Indicator for price_sensitivity_z above 0.5.
10
true_tau
Oracle diagnostic
True individual treatment effect from the simu...
11
segment
GATE group
Mutually exclusive teaching segment used for g...
12
outcome
Outcome
Observed continuous outcome after treatment as...
The dictionary separates columns used for modeling from oracle-only columns. That separation matters: using true_tau or propensity_true in the learner would turn the tutorial into leakage rather than causal estimation.
Basic Data Audit
The audit below checks sample size, treatment rate, propensity overlap, outcome scale, and the oracle average effect. These numbers are the first sanity check before any heterogeneity claim.
The treatment rate and propensity range indicate that the simulated design has overlap. Overlap is crucial: if a subgroup is almost always treated or almost never treated, its effect estimate becomes heavily model-dependent.
Segment Audit
GATE estimation needs groups with enough observations and both treatment states represented. This cell summarizes the mutually exclusive teaching segments before asking DoubleML for group effects.
The oracle column shows the pattern the estimator should recover: high-engagement users have larger effects, while price-sensitive users have smaller effects. In real data this column would not exist, so we would rely on design credibility, uncertainty, and validation checks instead.
Teaching Design Diagram
The diagram below summarizes the causal structure used in this notebook. Observed controls affect treatment assignment and baseline outcomes. Some observed controls also modify the treatment effect. DoubleML estimates nuisance functions for treatment assignment and outcomes, then uses orthogonal scores to estimate causal effects and heterogeneity summaries.
fig, ax = plt.subplots(figsize=(13, 6.5))ax.set_axis_off()nodes = {"X": {"xy": (0.10, 0.55), "label": "Observed\ncontrols X", "color": "#dbeafe"},"N": {"xy": (0.38, 0.78), "label": "Nuisance\nmodels", "color": "#ede9fe"},"D": {"xy": (0.40, 0.34), "label": "Treatment\nD", "color": "#fef3c7"},"T": {"xy": (0.68, 0.78), "label": "Effect\nmodifiers", "color": "#dcfce7"},"Y": {"xy": (0.88, 0.44), "label": "Outcome\nY", "color": "#fee2e2"},"H": {"xy": (0.68, 0.22), "label": "GATE/CATE\nsummary", "color": "#e0f2fe"},}box_w, box_h =0.16, 0.12def anchor(node, side): x, y = nodes[node]["xy"] offsets = {"left": (-box_w /2, 0),"right": (box_w /2, 0),"top": (0, box_h /2),"bottom": (0, -box_h /2),"upper_right": (box_w /2, box_h *0.25),"lower_right": (box_w /2, -box_h *0.25),"upper_left": (-box_w /2, box_h *0.25),"lower_left": (-box_w /2, -box_h *0.25), } dx, dy = offsets[side]return np.array([x + dx, y + dy], dtype=float)def shorten(start, end, gap=0.018): start = np.asarray(start, dtype=float) end = np.asarray(end, dtype=float) delta = end - start length = np.hypot(delta[0], delta[1])if length ==0:returntuple(start), tuple(end) unit = delta / lengthreturntuple(start + gap * unit), tuple(end - gap * unit)def draw_arrow(start, end, color="#334155", style="solid", rad=0.0, linewidth=1.7): start, end = shorten(start, end) arrow = FancyArrowPatch( start, end, arrowstyle="-|>", mutation_scale=18, linewidth=linewidth, color=color, linestyle=style, connectionstyle=f"arc3,rad={rad}", zorder=2, ) ax.add_patch(arrow)# Draw arrows before boxes so line endpoints tuck under the box borders cleanly.draw_arrow(anchor("X", "upper_right"), anchor("N", "left"), color="#475569", rad=0.03)draw_arrow(anchor("X", "lower_right"), anchor("D", "left"), color="#475569", rad=-0.04)draw_arrow(anchor("X", "right"), anchor("Y", "left"), color="#475569", rad=-0.12)draw_arrow(anchor("X", "upper_right"), anchor("T", "left"), color="#15803d", rad=0.08)draw_arrow(anchor("D", "upper_right"), anchor("Y", "lower_left"), color="#b45309", rad=0.00)draw_arrow(anchor("T", "bottom"), anchor("Y", "top"), color="#15803d", style="dashed", rad=-0.04)draw_arrow(anchor("N", "lower_right"), anchor("H", "upper_left"), color="#7c3aed", rad=-0.05)draw_arrow(anchor("T", "bottom"), anchor("H", "top"), color="#15803d", rad=0.04)for spec in nodes.values(): x, y = spec["xy"] rect = FancyBboxPatch( (x - box_w /2, y - box_h /2), box_w, box_h, boxstyle="round,pad=0.018", facecolor=spec["color"], edgecolor="#334155", linewidth=1.2, zorder=3, ) ax.add_patch(rect) ax.text(x, y, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)ax.text(0.50,0.08,"Heterogeneity summaries sit on top of the same identification assumptions as the original causal design.", ha="center", va="center", fontsize=10, color="#475569",)ax.set_title("Teaching Design For Heterogeneous Treatment Effects", pad=18)plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_hte_teaching_design.png", dpi=160, bbox_inches="tight")plt.show()
The dashed arrow from effect modifiers to the outcome represents effect modification: the causal effect of D changes with X. The GATE/CATE summary is downstream of the fitted orthogonal scores, not a separate identification strategy.
Fit The Baseline DoubleMLIRM Model
Because the treatment is binary and the outcome is continuous, this notebook uses DoubleMLIRM. The nuisance models are:
ml_g: outcome regressions for the treated and untreated potential outcome surfaces.
ml_m: the propensity model for treatment assignment.
The causal target here is the ATE. GATE and CATE/BLP summaries are then built from the fitted object.
The ATE estimate is the anchor. If the global effect estimate were unstable or inconsistent with the design, it would be premature to tell subgroup stories. Here the estimate is close enough to the oracle average effect for the synthetic tutorial to proceed.
Nuisance-Model Diagnostics
Heterogeneity summaries inherit the quality of the nuisance models. The next cell computes simple diagnostics from DoubleML’s cross-fitted predictions: outcome RMSE for ml_g0 and ml_g1, propensity log loss, propensity Brier score, and the estimated propensity range.
These are not causal diagnostics by themselves. They are checks that the machine-learning pieces are behaving sensibly before we use their orthogonal scores for causal summaries.
The propensity scores stay away from zero and one, which supports the overlap needed for subgroup analysis. The outcome RMSE values are also reasonable relative to the noise scale used in the simulator.
Propensity Overlap Plot
A visual overlap check is often easier to read than a table. If treated and untreated units occupied disjoint regions of estimated propensity, subgroup estimates would lean heavily on extrapolation.
The two distributions overlap across most of the range. That does not prove the causal design is correct, but it avoids the most obvious positivity failure for the tutorial example.
Oracle Effect Distribution
Before estimating heterogeneity, we look at the simulated true_tau distribution. This is a teaching luxury: it shows what kind of variation the GATE and CATE summaries are trying to compress.
The distribution is not a point mass, so an average effect hides real variation. The rest of the notebook asks how much of that variation can be summarized with groups and basis features.
Estimate GATEs With DoubleML
GATE estimation requires groups that are meaningful and sufficiently populated. The groups here are deliberately simple and mutually exclusive: high_engagement, price_sensitive, and general. DoubleML’s gate() method returns a DoubleMLBLP object, which provides coefficient summaries and confidence intervals.
The estimated group ranking follows the synthetic design: the high-engagement group benefits most, while the price-sensitive group has the smallest effect. Notice that the confidence intervals still matter; GATE estimates are summaries with uncertainty, not labels permanently attached to users.
Plot Estimated GATEs Against Oracle Group Effects
This plot compares estimated group effects with the oracle group means. In real data, the blue oracle markers would be absent; they are included here only to make the tutorial measurable.
The plot makes the reporting task clearer than a table alone. The high-engagement segment is above the ATE, the price-sensitive segment is below it, and the uncertainty bars show how much precision we actually have.
CATE As A Best Linear Predictor
DoubleML’s cate() method estimates a best linear predictor of the heterogeneous treatment effect signal on a basis supplied by the analyst. This is why basis design is not a cosmetic choice. The basis defines the heterogeneity language the model is allowed to speak.
The basis below includes continuous terms, threshold indicators, and one interaction. It is intentionally close to the synthetic data-generating process, but not identical enough to be trivial.
The coefficient signs line up with the simulator: high engagement is associated with larger treatment effects, price sensitivity is associated with smaller treatment effects, and category affinity contributes positively. The coefficient table should be read as a projection summary, not as proof that the true CATE is exactly linear.
Plot The BLP Coefficients
The coefficient plot is a compact way to communicate which basis features are associated with heterogeneity. We exclude the intercept so the plot focuses on variation around the average.
The coefficient plot is useful for screening patterns, but it should not replace substantive judgment. If the basis features were chosen after many failed attempts, the uncertainty shown here would understate the total search process.
CATE Profile Over Engagement
A BLP can also be evaluated on a grid of profiles. Here we vary engagement while holding price sensitivity and category affinity at zero and prior value at 0.8. This creates a simple curve showing how the fitted heterogeneity summary changes with engagement.
The profile table turns the CATE/BLP object into predicted effects for concrete covariate profiles. This is often easier for stakeholders to understand than a coefficient table, as long as we make clear that the curve is model-based.
Plot The CATE Profile
The confidence band below is a joint band from the BLP object. Joint bands are wider than pointwise intervals because they account for looking across the whole profile grid.
The estimated profile captures the main upward shift after the high-engagement threshold, though the band reminds us that profile-level heterogeneity is less precise than the global ATE.
CATE Profile Over Price Sensitivity
The previous profile varied a positive modifier. This profile varies price sensitivity, which the simulator makes a negative effect modifier. Looking at both directions helps prevent a one-sided heterogeneity story.
The profile table shows the estimated effect declining as price sensitivity rises. The sharp oracle drop appears at the threshold because the synthetic data-generating process included a threshold indicator.
Plot The Price-Sensitivity Profile
This plot is a useful contrast with the engagement profile: one modifier increases treatment value, while another reduces it.
This pattern is exactly the kind of result that needs careful translation. A lower estimated effect for a subgroup does not automatically mean the subgroup should be excluded; it means the treatment may need a different design, message, cost, or eligibility rule.
A Cautionary Subgroup Drill
Subgroup analysis is vulnerable to over-searching. To make that concrete, the next cell estimates GATEs for tenure terciles. In the simulator, tenure affects baseline outcomes but is not a direct effect modifier. If we still see variation, that variation is a reminder that subgroup estimates can move around even when the causal effect is not designed to differ much by that grouping.
The tenure groups have estimated differences, but the oracle group effects are much closer together than the main engagement and price-sensitive segments. This is why subgroup analysis should start with theory and pre-specified segmentation, not a search over every available column.
Plot The Cautionary Tenure GATEs
The tenure plot is deliberately included as a counterweight to the more convincing segment analysis above. Strong causal notebooks should show what not to overstate.
The tenure estimates are not useless, but they are weaker evidence than the pre-designed effect-modifier analysis. This is the kind of distinction that makes a heterogeneity report credible.
Candidate Reporting Table
A practical heterogeneity report should be short, explicit, and uncertainty-aware. The table below combines the strongest pre-defined GATE results with a cautious recommendation field. The recommendation text is intentionally conservative: causal heterogeneity can guide follow-up work, but it should not be oversold as automatic personalization.
reporting_table = gate_results.copy()reporting_table["estimated_lift_vs_ate"] = reporting_table["effect"] -float(irm_model.coef[0])reporting_table["reporting_note"] = np.select( [ reporting_table["ci_lower"] >float(irm_model.coef[0]), reporting_table["ci_upper"] <float(irm_model.coef[0]), ], ["Evidence that this group has above-average effect; validate before targeting.","Evidence that this group has below-average effect; examine treatment fit and harms.", ], default="Effect overlaps the ATE; report as uncertain rather than decisive.",)reporting_table = reporting_table[ ["segment","n","treatment_rate","effect","ci_lower","ci_upper","estimated_lift_vs_ate","true_gate","reporting_note", ]]save_table(reporting_table, f"{NOTEBOOK_PREFIX}_candidate_reporting_table.csv")display(reporting_table)
segment
n
treatment_rate
effect
ci_lower
ci_upper
estimated_lift_vs_ate
true_gate
reporting_note
1
high_engagement
372
0.5806
0.7856
0.3878
1.1834
0.2559
0.7252
Effect overlaps the ATE; report as uncertain r...
0
general
910
0.4813
0.5513
0.2838
0.8189
0.0216
0.4592
Effect overlaps the ATE; report as uncertain r...
2
price_sensitive
318
0.3428
0.1684
-0.2843
0.6211
-0.3613
0.2529
Effect overlaps the ATE; report as uncertain r...
The reporting table shows how to move from estimation to communication without pretending the estimates are perfect. The note column is not generated by DoubleML; it is analyst judgment layered on top of estimates and uncertainty.
Practical Reporting Checklist
This checklist records what should accompany any GATE/CATE result in a serious analysis. The core habit is simple: every heterogeneity claim should say what was pre-defined, what was estimated, how uncertainty was handled, and what assumptions remain.
reporting_checklist = pd.DataFrame( [ {"item": "Identification", "guidance": "State the unconfoundedness, overlap, and SUTVA assumptions used by the base DoubleML design."}, {"item": "Group definition", "guidance": "Explain whether GATE groups were pre-specified and confirm they are sufficiently populated."}, {"item": "Basis definition", "guidance": "List all CATE/BLP basis features and why they were chosen."}, {"item": "Uncertainty", "guidance": "Report confidence intervals or bands, not only point estimates."}, {"item": "Multiple comparisons", "guidance": "Avoid searching many subgroups without correction or validation."}, {"item": "Operational use", "guidance": "Treat heterogeneity as prioritization evidence, not an automatic deployment rule."}, {"item": "Robustness", "guidance": "Check whether the pattern survives alternative learners, sample splits, and segmentation choices."}, ])save_table(reporting_checklist, f"{NOTEBOOK_PREFIX}_reporting_checklist.csv")display(reporting_checklist)
item
guidance
0
Identification
State the unconfoundedness, overlap, and SUTVA...
1
Group definition
Explain whether GATE groups were pre-specified...
2
Basis definition
List all CATE/BLP basis features and why they ...
3
Uncertainty
Report confidence intervals or bands, not only...
4
Multiple comparisons
Avoid searching many subgroups without correct...
5
Operational use
Treat heterogeneity as prioritization evidence...
6
Robustness
Check whether the pattern survives alternative...
This checklist is intentionally not code-heavy. Heterogeneity work fails as often in framing and communication as it does in estimation.
Write A Reusable Report Template
The final report template captures the narrative structure created by this notebook. It is saved as Markdown so it can be reused outside Jupyter.
report_template =f"""# Heterogeneous Treatment Effects Report Template## Causal QuestionState the treatment, outcome, population, and time horizon. Explain why an average effect is not enough for the decision.## Identification DesignDescribe the assumptions behind the base DoubleML model: observed confounding adjustment, overlap, and no interference. Note that GATE and CATE summaries inherit these assumptions.## Base ATE- Estimated ATE: {float(irm_model.coef[0]):.4f}- Standard error: {float(irm_model.se[0]):.4f}- 95% confidence interval: [{float(irm_model.confint().iloc[0, 0]):.4f}, {float(irm_model.confint().iloc[0, 1]):.4f}]## Pre-Specified GATEsReport group definitions, group sizes, treatment rates, estimated effects, and confidence intervals. Avoid ranking groups without uncertainty.## CATE/BLP SummaryList the basis features used for the BLP. Explain that coefficients summarize a projection of the heterogeneous effect signal, not a full structural model of individual treatment effects.## Subgroup CautionsDocument any exploratory groupings separately from pre-specified groupings. Treat exploratory findings as hypotheses for validation.## Decision GuidanceTranslate the heterogeneity pattern into follow-up experiments, monitoring plans, or product design changes. Avoid direct targeting decisions unless the design and validation are strong enough."""report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_hte_report_template.md"report_path.write_text(report_template)print(f"Wrote report template to: {report_path}")
Wrote report template to: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/14_hte_report_template.md
The saved template is deliberately conservative. It keeps the distinction between estimation, validation, and action visible.
Artifact Manifest
The manifest below lists the main artifacts created by this notebook. Keeping a manifest makes it easier to review outputs without reopening every cell.
The artifacts make the notebook reviewable as a small analysis package: data, tables, figures, and a report template all live in predictable locations.
What Comes Next
This notebook covered heterogeneity summaries after fitting a valid DoubleML model. The next natural tutorial is policy learning and weighted effects: how to turn estimated effect patterns into cautious decision rules, value estimates, quantiles, and downside-risk summaries.
The main lesson to carry forward is that heterogeneity analysis is not just a better plot. It is a higher-risk causal claim because it asks for more detail from the same data. Use GATE and CATE/BLP to structure evidence, not to bypass design assumptions.