The previous notebooks focused mostly on average treatment effects. An average effect answers a valuable question: “what is the mean causal effect of treatment in this population?” But many practical decisions need a more granular answer: “for which kinds of units is the effect larger, smaller, or possibly harmful?”
That second question is about conditional average treatment effects, usually shortened to CATE. A CATE is an average treatment effect conditional on some unit characteristics, such as baseline intent, history, segment, risk score, geography, or prior engagement.
This notebook teaches CATE through a controlled example where the true treatment effect is known. We will use DoWhy for the graph, identification, average effect estimation, treatment-effect modifiers, and an EconML-backed causal forest. Along the way, we will keep the core discipline from earlier notebooks: heterogeneous effects are still causal effects, so they only make sense after the causal design is credible.
Learning Goals
By the end of this notebook, you should be able to:
Explain the difference between an ATE and a CATE.
Distinguish confounders from effect modifiers.
Simulate a setting where treatment effects vary across users.
Estimate an average effect with DoWhy as the baseline target.
Add effect modifiers to a DoWhy linear regression estimator.
Produce segment-level and bucket-level CATE summaries.
Estimate more flexible CATEs with DoWhy’s EconML integration.
Evaluate CATE models with calibration-style diagnostics when ground truth is available.
Explain why overlap and causal identification still matter when modeling heterogeneity.
ATE Versus CATE
The average treatment effect is a population-level summary:
[ ATE = E[Y(1) - Y(0)] ]
A conditional average treatment effect is the same causal contrast, but averaged inside a subgroup or at a feature value:
[ CATE(x) = E[Y(1) - Y(0) X=x] ]
The CATE is not automatically more useful than the ATE. It is more detailed, which means it can also be noisier and easier to overfit. The right mental model is:
ATE: the best single-number causal summary for the population.
CATE: how that causal effect changes across observed characteristics.
ITE prediction: an estimated unit-level effect score, usually produced by a model, that should be summarized and validated rather than treated as exact truth.
This notebook moves from the ATE to increasingly detailed CATE views.
Confounders And Effect Modifiers
A variable can play different causal roles.
A confounder affects treatment and outcome. It must be handled for identification. If we ignore an important confounder, the causal effect estimate can be biased.
An effect modifier changes the size of the treatment effect. It helps answer heterogeneity questions. Effect modifiers do not usually change whether the average effect is identifiable, but they change how we model and report the effect.
The same variable can be both. For example, baseline intent can affect whether a user gets exposed to a recommendation and also change how strongly exposure affects future value. In that case, it belongs in the adjustment set and in the heterogeneity model.
Setup
This cell imports the tutorial dependencies, suppresses known non-actionable library warnings, creates output folders, and sets a stable plotting style. The MPLCONFIGDIR setting avoids notebook-execution warnings in environments where the default Matplotlib cache directory is not writable.
from pathlib import Pathimport osimport warnings# Keep Matplotlib quiet in sandboxed or shared environments where the default cache path may not be writable.os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib-ranking-sys")warnings.filterwarnings("default")warnings.filterwarnings("ignore", category=DeprecationWarning)warnings.filterwarnings("ignore", category=PendingDeprecationWarning)warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message=".*IProgress not found.*")warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")warnings.filterwarnings("ignore", message=".*disp.*iprint.*L-BFGS-B.*")warnings.filterwarnings("ignore", message=".*variables are assumed unobserved.*")warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")warnings.filterwarnings("ignore", module="seaborn.categorical")warnings.filterwarnings("ignore", module="pydot.dot_parser")warnings.filterwarnings("ignore", module="econml")import dowhyimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport seaborn as snsimport statsmodels.formula.api as smffrom dowhy import CausalModelfrom econml.dml import CausalForestDMLfrom IPython.display import displayfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import mean_squared_error, roc_auc_scorefrom sklearn.model_selection import train_test_splitpd.set_option("display.max_columns", 100)pd.set_option("display.width", 150)pd.set_option("display.float_format", "{:.4f}".format)sns.set_theme(style="whitegrid", context="notebook")for candidate in [Path.cwd(), *Path.cwd().parents]:if (candidate /"notebooks"/"tutorials"/"dowhy").exists(): PROJECT_ROOT = candidatebreakelse: PROJECT_ROOT = Path.cwd()NOTEBOOK_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"dowhy"OUTPUT_DIR = NOTEBOOK_DIR /"outputs"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)RNG = np.random.default_rng(707)print(f"DoWhy version: {dowhy.__version__}")print(f"Notebook directory: {NOTEBOOK_DIR}")print(f"Figure output directory: {FIGURE_DIR}")print(f"Table output directory: {TABLE_DIR}")
The notebook is ready once the package versions and output folders print. All generated outputs from this notebook use a 07_ prefix.
Strategy Map For Heterogeneous Effects
CATE analysis is not one estimator. It is a workflow. The table below lays out the sequence we will follow: start with identification, estimate the average effect, then ask how that effect varies across meaningful features.
cate_strategy_map = pd.DataFrame( [ {"step": "Define the causal target","question": "What treatment and outcome are being contrasted?","why it matters": "Heterogeneity is still about potential outcomes, not prediction alone.", }, {"step": "Identify the average effect","question": "Which variables block backdoor paths?","why it matters": "CATE modeling cannot repair a broken causal design.", }, {"step": "Choose effect modifiers","question": "Which pre-treatment features might change the effect size?","why it matters": "Effect modifiers should be known before treatment and meaningful to explain.", }, {"step": "Estimate subgroup effects","question": "How does the effect differ by segment or feature bucket?","why it matters": "Grouped estimates are easier to audit than individual scores.", }, {"step": "Fit flexible CATE models","question": "Can a richer model capture nonlinear effect patterns?","why it matters": "Flexible CATE models can help targeting, but they need validation.", }, {"step": "Check overlap and stability","question": "Do treated and untreated units exist in each subgroup?","why it matters": "No model can estimate a contrast where the data contain no contrast.", }, ])cate_strategy_map.to_csv(TABLE_DIR /"07_cate_strategy_map.csv", index=False)display(cate_strategy_map)
step
question
why it matters
0
Define the causal target
What treatment and outcome are being contrasted?
Heterogeneity is still about potential outcome...
1
Identify the average effect
Which variables block backdoor paths?
CATE modeling cannot repair a broken causal de...
2
Choose effect modifiers
Which pre-treatment features might change the ...
Effect modifiers should be known before treatm...
3
Estimate subgroup effects
How does the effect differ by segment or featu...
Grouped estimates are easier to audit than ind...
4
Fit flexible CATE models
Can a richer model capture nonlinear effect pa...
Flexible CATE models can help targeting, but t...
5
Check overlap and stability
Do treated and untreated units exist in each s...
No model can estimate a contrast where the dat...
The workflow starts with causal design, not with a model leaderboard. That order keeps the notebook anchored: a sophisticated CATE model is only useful if the effect being modeled is identified.
Simulate A Heterogeneous Treatment Effect Dataset
We will simulate a binary treatment called recommendation_exposure and a continuous outcome called weekly_value. The treatment effect is deliberately heterogeneous:
higher for users in the power_segment,
higher for users with stronger baseline_intent,
mildly nonlinear for very high-intent users.
The treatment assignment is confounded because baseline intent, segment, and seasonality all affect the probability of exposure and the outcome. This lets us practice adjustment and heterogeneity modeling in one example.
Rows: 5,000
Observed exposure rate: 0.575
True ATE from the data-generating process: 0.882
recommendation_exposure
weekly_value
baseline_intent
power_segment
seasonality_score
prior_value
treatment_probability
true_cate
segment_label
0
0
5.1091
-0.1408
1
-0.6061
3.3615
0.6299
1.0367
power
1
0
-2.3683
-2.6585
0
-1.3053
-2.1878
0.0593
-0.6463
standard
2
0
8.5559
1.8354
0
1.5331
4.0925
0.8947
1.8348
standard
3
0
6.5446
-0.6957
1
-0.0646
1.6789
0.5077
0.7869
power
4
0
1.6414
-0.7046
0
-1.2943
-0.0873
0.2475
0.2329
standard
The dataframe includes true_cate, which is available only because this is a simulation. In a real dataset, that column would not exist. We keep it here so the tutorial can evaluate whether the CATE estimates are learning the intended effect pattern.
Data Field Guide
The next table documents every column used in the simulation. This is good practice for CATE work because variables can play more than one role, and confusing a confounder with an effect modifier can lead to muddled analysis.
field_guide = pd.DataFrame( [ {"column": "recommendation_exposure","role": "treatment","description": "Binary indicator for whether the unit received the recommendation exposure.", }, {"column": "weekly_value","role": "outcome","description": "Continuous post-treatment value outcome.", }, {"column": "baseline_intent","role": "confounder and effect modifier","description": "Pre-treatment intent score that affects exposure, outcome, and treatment-effect size.", }, {"column": "power_segment","role": "confounder and effect modifier","description": "Binary pre-treatment segment indicator that affects exposure, outcome, and treatment-effect size.", }, {"column": "seasonality_score","role": "confounder","description": "Pre-treatment timing score that affects exposure and outcome but not the treatment effect directly.", }, {"column": "prior_value","role": "confounder","description": "Pre-treatment value summary that affects exposure and outcome.", }, {"column": "treatment_probability","role": "known simulation diagnostic","description": "The true treatment probability used to generate exposure; normally unknown in observational data.", }, {"column": "true_cate","role": "known simulation diagnostic","description": "The true treatment effect for each row; normally unobserved in real data.", }, {"column": "segment_label","role": "display label","description": "Human-readable version of the power segment indicator.", }, ])field_guide.to_csv(TABLE_DIR /"07_field_guide.csv", index=False)display(field_guide)
column
role
description
0
recommendation_exposure
treatment
Binary indicator for whether the unit received...
1
weekly_value
outcome
Continuous post-treatment value outcome.
2
baseline_intent
confounder and effect modifier
Pre-treatment intent score that affects exposu...
3
power_segment
confounder and effect modifier
Binary pre-treatment segment indicator that af...
4
seasonality_score
confounder
Pre-treatment timing score that affects exposu...
5
prior_value
confounder
Pre-treatment value summary that affects expos...
6
treatment_probability
known simulation diagnostic
The true treatment probability used to generat...
7
true_cate
known simulation diagnostic
The true treatment effect for each row; normal...
8
segment_label
display label
Human-readable version of the power segment in...
Notice that baseline_intent and power_segment appear as both confounders and effect modifiers. They are needed for adjustment, and they are also useful for understanding where the effect is larger.
Basic Data Checks
Before estimating anything, we inspect shape, missingness, and core distribution summaries. This is not just housekeeping: CATE analysis splits the data into smaller groups, so data quality problems become more painful than they are for a single average effect.
The data are complete and the exposure rate is not extreme. The true CATE has meaningful spread, which means a single ATE will hide important variation.
Treatment Assignment Is Confounded
Treatment exposure is not random in this dataset. The next table compares pre-treatment characteristics between exposed and unexposed units. If the groups differ before treatment, an unadjusted outcome comparison is not a causal estimate.
The exposed group has higher baseline intent, higher prior value, and a different segment mix. It also has a higher average true CATE. That is exactly why adjustment and overlap checks are still necessary before modeling heterogeneity.
Visualize The True Heterogeneity
Because this is a simulation, we can inspect the true CATE directly. The plot below shows the true effect by baseline intent and segment. In real work, this plot is impossible; we would only have estimates.
The effect rises with baseline intent, and the power segment sits above the standard segment. This is the pattern our estimators should recover if the model class is appropriate and the adjustment variables are sufficient.
Ground-Truth CATE By Coarse Groups
Before fitting models, we summarize the true effect by segment and baseline-intent bucket. This gives us a simple target table to compare against later.
The grouped truth table is a useful reference. Later, estimated CATE tables should show the same broad ordering: power segment above standard, high intent above low intent.
Build The Causal Graph
The graph encodes the backdoor adjustment problem. Baseline intent, segment, seasonality, and prior value are all pre-treatment common causes of exposure and outcome. The graph does not need a special “effect modification arrow.” Effect modification is represented in the outcome model through treatment interactions.
The graph says that the average effect can be identified by adjusting for the observed common causes. The heterogeneity analysis will reuse that same identified causal contrast and then model how the contrast changes across pre-treatment features.
Create The DoWhy Model And Identify The Effect
This is the standard DoWhy workflow from earlier notebooks: create a CausalModel, then ask DoWhy to identify the causal effect. The estimand is still an ATE at this stage. CATE comes during estimation and reporting.
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d ↪
─────────────────────────(E[weekly_value|power_segment,baseline_intent,seasona ↪
d[recommendationₑₓₚₒₛᵤᵣₑ] ↪
↪
↪ lity_score,prior_value])
↪
Estimand assumption 1, Unconfoundedness: If U→{recommendation_exposure} and U→weekly_value then P(weekly_value|recommendation_exposure,power_segment,baseline_intent,seasonality_score,prior_value,U) = P(weekly_value|recommendation_exposure,power_segment,baseline_intent,seasonality_score,prior_value)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
d ↪
─────────────────────────(E[weekly_value|power_segment,baseline_intent,seasona ↪
d[recommendationₑₓₚₒₛᵤᵣₑ] ↪
↪
↪ lity_score,prior_value])
↪
Estimand assumption 1, Unconfoundedness: If U→{recommendation_exposure} and U→weekly_value then P(weekly_value|recommendation_exposure,power_segment,baseline_intent,seasonality_score,prior_value,U) = P(weekly_value|recommendation_exposure,power_segment,baseline_intent,seasonality_score,prior_value)
DoWhy identifies a backdoor estimand using the observed common causes. This is the causal foundation for both the average effect and the heterogeneous effect summaries below.
Estimate The Average Treatment Effect First
Before modeling heterogeneity, estimate the ATE. This gives us a baseline causal summary and a check against the known simulation truth.
The adjusted ATE should be much closer to the true ATE than the naive difference. That gives us confidence that the graph and adjustment variables are doing useful work before we ask a harder heterogeneity question.
Add Effect Modifiers To The DoWhy Linear Estimator
DoWhy’s linear regression estimator can include treatment interactions with effect modifiers. Here we mark baseline_intent and power_segment as effect modifiers. They are also adjustment variables, which is allowed: they help block confounding and help explain variation in effect size.
We turn off automatic conditional estimates in this cell because we will compute subgroup effects explicitly. That keeps the grouping logic transparent and avoids hiding important choices about buckets.
The realized estimator expression includes treatment interactions. Those interactions are what allow the estimated treatment effect to vary across baseline intent and segment.
Helper For Subgroup Effects
The fitted DoWhy estimator can compute the treatment contrast on any subset of rows by comparing predicted outcomes under treatment and control. This helper uses the already-fitted estimator and returns the average contrast for a provided slice.
def estimate_effect_for_slice(estimate, data_slice):"""Use a fitted DoWhy estimator to estimate the average treatment effect inside a slice."""iflen(data_slice) ==0:return np.nanreturn estimate.estimator.estimate_effect( data_slice[dowhy_columns], treatment_value=1, control_value=0, need_conditional_estimates=False, ).value# Quick smoke test on the full data: this should match the average effect from the fitted linear CATE model.full_slice_effect = estimate_effect_for_slice(linear_cate_estimate, cate_df)print(f"Full-sample effect from slice helper: {full_slice_effect:.3f}")print(f"Stored linear CATE model average effect: {linear_cate_estimate.value:.3f}")
Full-sample effect from slice helper: 0.909
Stored linear CATE model average effect: 0.909
The helper returns the same full-sample effect as the stored estimate. Now we can use it for interpretable subgroup summaries.
Segment-Level CATE
The first heterogeneity view compares the estimated effect for standard and power users. This is intentionally simple: before showing individual-level scores, it is better to verify that coarse groups move in the expected direction.
The estimated segment effects should preserve the main pattern: power users have a larger treatment effect than standard users. The table also shows exposure rates, because subgroup effects are less credible when one subgroup has little treated or untreated support.
Intent-Bucket CATE
Now we estimate CATE by baseline-intent quartile. Bucketed effects are less granular than individual scores, but they are easier to explain and audit.
The estimated CATE should rise from low intent to high intent. This is the first check that the effect-modifier model is learning the intended slope.
Two-Way CATE Table
Many heterogeneity stories are two-dimensional. Here we cross segment with intent bucket. This table is especially useful for communication because it shows both sources of variation at once.
The two-way view should show a ladder: higher intent increases the effect within each segment, and the power segment is higher within each intent bucket.
Plot Linear CATE By Group
A grouped bar chart makes the two-way CATE pattern easier to scan. We plot estimated and true CATE side by side so the model’s misses are visible.
The linear model captures the broad monotonic pattern. It may still miss nonlinear details, especially at high baseline intent where the true effect contains an extra positive bend.
DoWhy’s Built-In Conditional Estimates For A Numeric Modifier
DoWhy can also compute conditional estimates automatically for numeric effect modifiers by splitting them into quantile bins. Here we ask for conditional estimates over baseline_intent only. We keep this example one-dimensional so the output is easy to read.
The automatic conditional estimates are useful for quick exploration. For reporting, it is often better to create named buckets yourself so the table labels are stable and easier to explain.
Individual Effect Scores From The Linear CATE Model
A CATE model can score each row by comparing predicted outcomes under treatment and control. These row-level scores are not observed facts; they are model-based estimates. In this simulation, we can compare them to true_cate to see how well the model is doing.
The correlation and RMSE summarize how well the linear CATE score tracks the true treatment effect. A good average effect can coexist with weaker individual ranking, so both checks are useful.
Linear CATE Calibration Plot
This scatterplot compares estimated and true row-level effects. The diagonal line marks perfect agreement. Because there are many points, we sample a subset for readability.
The linear model should rank effects reasonably well, but the shape may be too simple for the nonlinear high-intent region. That motivates trying a more flexible CATE model.
Flexible CATE With DoWhy And EconML
DoWhy can call EconML estimators through method names such as backdoor.econml.dml.CausalForestDML. This keeps the causal graph and identification step in DoWhy while using a flexible estimator for heterogeneous effects.
A causal forest is useful when treatment effects vary nonlinearly or through feature interactions. It is also easier to overfit than a simple linear interaction model, so we will evaluate it rather than assume it is better.
The forest average effect should be close to the true ATE. The main reason to fit the forest is not the average, though; it is the possibility of better effect ranking when the true CATE is nonlinear.
Score Rows With The Causal Forest
The fitted EconML estimator exposes an effect method. We pass the same effect-modifier features used during fitting and store the resulting CATE score.
The forest usually improves row-level ranking and nonlinear fit, while the linear model remains easier to explain. This is a familiar tradeoff: flexibility can improve CATE scoring, but it also raises the bar for diagnostics.
Compare Linear And Forest CATE Scores
The next plot compares both model scores against the true CATE. It is a compact way to see whether the flexible model is adding useful signal.
The diagonal reference line makes underestimation and overestimation visible. The forest can bend with the nonlinear region, while the linear model is constrained to a simpler shape.
CATE Ranking And Targeting Diagnostics
A common use of CATE models is prioritization: focus treatment where the expected causal gain is largest. In this simulation, we can check whether higher predicted-effect groups truly have higher effects.
We will create quintiles based on each model’s estimated CATE score and compare the average true CATE inside each quintile.
A useful CATE model should produce a rising mean_true_cate from the lowest to highest score quintile. This is a targeting diagnostic, not a proof of validity, but it is very helpful when ground truth or experimental validation is available.
Plot CATE Ranking Quality
The line chart below shows whether higher predicted CATE buckets correspond to higher true effects. This is often easier to communicate than a dense individual-level scatterplot.
The best pattern is a clear upward slope. If the line is flat, the model may estimate the average effect well but fail to rank units by treatment-effect size.
Overlap By Predicted CATE Group
CATE models can create confident-looking scores in regions with poor treatment overlap. The next diagnostic estimates a simple propensity model, then summarizes propensity and exposure rates by predicted CATE quintile.
This does not replace the causal graph, but it helps reveal whether the highest-score groups have both treated and untreated examples.
The table checks whether the highest predicted-effect group still has real treatment variation. If a group is almost entirely treated or untreated, its CATE estimate leans heavily on extrapolation.
Plot Propensity Distributions Across Forest CATE Groups
The boxplot below shows estimated propensity by CATE quintile. It helps catch a common problem: the model’s “best treatment group” may also be a group where treatment assignment was nearly deterministic.
The dashed lines mark very rough danger zones. In real work, poor overlap should trigger trimming, redesigned analysis, or more cautious claims about the affected region.
Model Comparison Summary
This final table compares the average effect, row-level score quality, and targeting quality for the linear and forest CATE approaches. The targeting lift here is computed as the true CATE in the highest score quintile minus the population true ATE.
The better CATE model is not always the model with the best average effect. For heterogeneity work, ranking and subgroup calibration matter because the model may be used to decide where treatment is most valuable.
Practical CATE Checklist
Use this checklist when moving from tutorial data to real data:
Make the causal graph and adjustment set explicit before modeling heterogeneity.
Use only pre-treatment variables as effect modifiers.
Start with interpretable subgroup CATE tables before showing individual scores.
Check treatment overlap inside the subgroups that drive the story.
Compare simple and flexible CATE models; do not assume complexity wins.
Validate CATE ranking with experiments, holdout randomized data, or domain-specific stress tests whenever possible.
Report uncertainty and limitations, especially for small or poorly supported subgroups.
Practice Prompts
Try these extensions after running the notebook:
Remove baseline_intent from the adjustment set but keep it as an effect modifier. How do the ATE and CATE summaries change?
Increase the nonlinear term in true_cate. Does the causal forest pull farther ahead of the linear model?
Make treatment assignment nearly deterministic for high-intent users. What happens to the overlap diagnostics?
Replace quintiles with deciles. Which CATE summaries become noisier?
Create a final one-page stakeholder summary that reports the ATE, two subgroup effects, and the main overlap limitation.
What Comes Next
The next tutorial focuses on refuters, placebos, negative controls, and sensitivity checks. That is the natural follow-up: once we estimate an effect or a pattern of heterogeneous effects, we need to ask how fragile the result is to alternative explanations.