EconML Tutorial 09: Inference, Intervals, And Uncertainty
This notebook is about the part of causal machine learning that often gets skipped when people are excited about heterogeneous treatment effects: how sure are we?
A CATE model can produce a treatment-effect estimate for every user, customer, product, region, or session. That is powerful, but it also creates a dangerous temptation: sort everyone by the point estimate and treat the top group. In real decision systems, point estimates are noisy. Some units look high-benefit because the model has strong signal. Others look high-benefit because they live in sparse parts of the data where the estimator is uncertain.
EconML includes estimators that can return treatment-effect intervals. Those intervals do not prove the causal design is correct, but they help answer practical questions:
Is the average treatment effect clearly different from zero?
Are the highest predicted CATEs also precise enough to act on?
Which segments have wide intervals because overlap is weak or data are sparse?
How much does a policy change when we rank by a lower confidence bound instead of a raw point estimate?
What should we report so a reader can distinguish signal from model noise?
We use synthetic data with known ground-truth treatment effects. In real work the true CATE is not observed, but this teaching setup lets us check whether uncertainty diagnostics behave sensibly.
Learning Goals
By the end of this notebook, you should be able to fit EconML estimators that return treatment-effect intervals, distinguish ATE uncertainty from CATE uncertainty, diagnose where estimates are least precise, compare point-estimate targeting with lower-confidence-bound targeting, and write a transparent uncertainty report.
Tutorial Flow
We will build a confounded observational teaching dataset, fit CausalForestDML and LinearDML, extract CATE intervals, inspect coverage and interval width against known synthetic truth, connect wide intervals to weak overlap, and convert the estimates into uncertainty-aware treatment rules.
Setup
This cell imports the packages used in the notebook, creates output folders, fixes plotting defaults, and suppresses a few harmless library warnings that otherwise clutter tutorial output. The code is intentionally visible because reproducibility matters: readers should see exactly how the environment is configured before any causal model is fit.
from pathlib import Pathimport osimport warnings# Suppress an optional widget warning that can appear while importing EconML in headless notebook runs.warnings.filterwarnings("ignore", message="IProgress not found.*")# Keep Matplotlib cache files in a writable location during notebook execution.os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib")import econmlimport matplotlib.pyplot as pltfrom matplotlib.ticker import PercentFormatterimport numpy as npimport pandas as pdimport seaborn as snsfrom IPython.display import displayfrom scipy.special import expitfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_scorefrom sklearn.model_selection import train_test_splitfrom econml.dml import CausalForestDML, LinearDML# EconML internally converts between pandas and NumPy in a few places. These warnings do not change the results.warnings.filterwarnings("ignore", message="X does not have valid feature names.*", category=UserWarning)warnings.filterwarnings("ignore", message="Not all column names are strings.*", category=UserWarning)warnings.filterwarnings("ignore", category=FutureWarning)sns.set_theme(style="whitegrid", context="notebook")plt.rcParams["figure.figsize"] = (10, 6)plt.rcParams["axes.titleweight"] ="bold"plt.rcParams["axes.labelsize"] =11def find_project_root(start=None):"""Find the repository root from either the repo or a nested notebook folder.""" start = Path.cwd() if start isNoneelse Path(start)for candidate in [start, *start.parents]:if (candidate /"pyproject.toml").exists() and (candidate /"notebooks").exists():return candidatereturn Path.cwd()PROJECT_ROOT = find_project_root()NOTEBOOK_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"econml"OUTPUT_DIR = NOTEBOOK_DIR /"outputs"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)rng = np.random.default_rng(202609)print(f"Project root: {PROJECT_ROOT}")print(f"EconML version: {econml.__version__}")print(f"Figures will be saved to: {FIGURE_DIR.relative_to(PROJECT_ROOT)}")print(f"Tables will be saved to: {TABLE_DIR.relative_to(PROJECT_ROOT)}")
Project root: /home/apex/Documents/ranking_sys
EconML version: 0.16.0
Figures will be saved to: notebooks/tutorials/econml/outputs/figures
Tables will be saved to: notebooks/tutorials/econml/outputs/tables
The environment is ready. Every artifact from this notebook will use the 09_ prefix so the outputs remain easy to find alongside the rest of the tutorial series.
What We Mean By Uncertainty
Before fitting models, it helps to separate different uncertainty objects. A confidence interval around an ATE answers a different question from a confidence interval around a CATE for one row. A wide interval can come from low sample size, weak treatment overlap, noisy outcomes, high model variance, or all of these at once.
uncertainty_map = pd.DataFrame( [ {"object": "ATE interval","question_answered": "How uncertain is the average treatment effect over a target population?","common_use": "Executive summary, experiment-style reporting, broad go/no-go decisions.","main_risk": "Can hide large positive and negative segment effects that cancel out.", }, {"object": "CATE interval","question_answered": "How uncertain is the treatment effect for a specific covariate profile?","common_use": "Personalization, segment diagnostics, treatment ranking.","main_risk": "Many row-level intervals are noisy and should not be overread individually.", }, {"object": "Interval width","question_answered": "Where is the estimator less precise?","common_use": "Overlap diagnostics, data-quality checks, conservative targeting.","main_risk": "A narrow interval is not evidence that the causal assumptions are true.", }, {"object": "Lower confidence bound","question_answered": "Which units still look beneficial after accounting for uncertainty?","common_use": "Risk-aware targeting when treatment has a cost or downside.","main_risk": "Can be overly conservative if intervals are poorly calibrated or data are small.", }, {"object": "Bootstrap policy interval","question_answered": "How stable is the estimated value of a targeting rule under row resampling?","common_use": "Policy comparison, robustness checks, reporting uncertainty around uplift targeting.","main_risk": "Row bootstrap does not fully capture model-fitting uncertainty unless models are refit.", }, ])uncertainty_map.to_csv(TABLE_DIR /"09_uncertainty_map.csv", index=False)display(uncertainty_map)
object
question_answered
common_use
main_risk
0
ATE interval
How uncertain is the average treatment effect ...
Executive summary, experiment-style reporting,...
Can hide large positive and negative segment e...
1
CATE interval
How uncertain is the treatment effect for a sp...
Personalization, segment diagnostics, treatmen...
Many row-level intervals are noisy and should ...
2
Interval width
Where is the estimator less precise?
Overlap diagnostics, data-quality checks, cons...
A narrow interval is not evidence that the cau...
3
Lower confidence bound
Which units still look beneficial after accoun...
Risk-aware targeting when treatment has a cost...
Can be overly conservative if intervals are po...
4
Bootstrap policy interval
How stable is the estimated value of a targeti...
Policy comparison, robustness checks, reportin...
Row bootstrap does not fully capture model-fit...
The table gives us a vocabulary for the rest of the notebook. The important habit is to connect each interval to a decision: average launch decisions need ATE uncertainty, while targeted rollout decisions need CATE and policy-value uncertainty.
Teaching Data Design
We will generate a synthetic observational dataset. The treatment is not randomly assigned: high-need users and low-friction users are more likely to receive the treatment. The outcome is continuous, and the true treatment effect varies across users.
The data also include a support_risk feature. This marks rows with weaker treatment overlap and noisier outcomes, which lets us teach why interval width often has a data-support explanation.
The first rows show the complete teaching table. In real causal work we would not have true_propensity, true_cate, or noise_scale; they are included here only so we can evaluate whether the uncertainty diagnostics are behaving in a reasonable way.
Field Dictionary
A reader should never have to infer the meaning of a synthetic feature from code alone. This table describes each field and marks whether it is an observed covariate, treatment, outcome, or teaching-only truth variable.
field_dictionary = pd.DataFrame( [ ("baseline_need", "Observed covariate", "Latent demand or need level before treatment."), ("prior_engagement", "Observed covariate", "Historical engagement before treatment assignment."), ("friction_score", "Observed covariate", "Higher values mean more friction in the user experience."), ("content_affinity", "Observed covariate", "Match between the user and the content or offer."), ("price_sensitivity", "Observed covariate", "Higher values mean the user is more sensitive to cost or effort."), ("account_tenure", "Observed covariate", "Age of the account in weeks."), ("region_risk", "Observed covariate", "Binary marker for regions with lower baseline outcomes."), ("seasonality_index", "Observed covariate", "Time or market condition signal observed before treatment."), ("device_stability", "Observed covariate", "Technical stability proxy; higher values are better."), ("high_need_segment", "Observed covariate", "Binary segment derived from baseline need."), ("support_risk", "Observed covariate", "Marker for weaker overlap and noisier outcomes."), ("treatment", "Treatment", "Whether the unit received the intervention."), ("outcome", "Outcome", "Observed continuous outcome after treatment assignment."), ("true_propensity", "Teaching-only truth", "Known probability of treatment in the synthetic assignment process."), ("true_cate", "Teaching-only truth", "Known conditional treatment effect for each row."), ("noise_scale", "Teaching-only truth", "Known residual noise level used when generating the outcome."), ], columns=["field", "role", "description"],)field_dictionary.to_csv(TABLE_DIR /"09_field_dictionary.csv", index=False)display(field_dictionary)
field
role
description
0
baseline_need
Observed covariate
Latent demand or need level before treatment.
1
prior_engagement
Observed covariate
Historical engagement before treatment assignm...
2
friction_score
Observed covariate
Higher values mean more friction in the user e...
3
content_affinity
Observed covariate
Match between the user and the content or offer.
4
price_sensitivity
Observed covariate
Higher values mean the user is more sensitive ...
5
account_tenure
Observed covariate
Age of the account in weeks.
6
region_risk
Observed covariate
Binary marker for regions with lower baseline ...
7
seasonality_index
Observed covariate
Time or market condition signal observed befor...
8
device_stability
Observed covariate
Technical stability proxy; higher values are b...
9
high_need_segment
Observed covariate
Binary segment derived from baseline need.
10
support_risk
Observed covariate
Marker for weaker overlap and noisier outcomes.
11
treatment
Treatment
Whether the unit received the intervention.
12
outcome
Outcome
Observed continuous outcome after treatment as...
13
true_propensity
Teaching-only truth
Known probability of treatment in the syntheti...
14
true_cate
Teaching-only truth
Known conditional treatment effect for each row.
15
noise_scale
Teaching-only truth
Known residual noise level used when generatin...
The field dictionary makes the later diagnostics easier to read. Notice that support_risk is an observed covariate, not hidden truth; a real analyst could use a similar support signal to decide where more data are needed before fine-grained targeting.
Basic Shape And Outcome Summary
This cell checks the sample size, treatment rate, outcome distribution, true effect distribution, and overlap range. These basic summaries are not formal causal diagnostics, but they catch many problems before we fit a model.
The true average effect is positive, but the CATE standard deviation and positive-effect share tell us there is meaningful heterogeneity. That is the setting where intervals matter: we are not just asking whether the average effect is positive, we are asking which rows are reliably positive.
True CATE Distribution
Because this is synthetic data, we can plot the true CATE distribution. This plot is a teaching anchor: every estimated interval later in the notebook is trying to represent uncertainty around values drawn from this distribution.
The distribution has both low-benefit and high-benefit regions. A point-estimate-only policy will tend to focus on the right tail, while an uncertainty-aware policy will ask whether that right tail is estimated precisely enough to trust.
Naive Treated-Control Difference
Before using EconML, we compute the raw treated-control difference. This is not a causal estimate because treatment assignment is confounded. The point is to show why uncertainty around a naive contrast is not enough: a precise biased estimate can still be wrong.
The treated group differs from the control group before we adjust for covariates. This means the raw difference mixes treatment effect, selection bias, and baseline outcome differences. EconML’s uncertainty intervals will be more useful only after we address this confounding structure.
Covariate Balance
This cell computes standardized mean differences between treated and control groups. A standardized mean difference larger than about 0.10 is a practical warning sign that treatment and control groups differ on that feature.
Several covariates are imbalanced, especially features that also affect the outcome or the true CATE. This reinforces why causal modeling has to adjust for observed confounding before we attach meaning to interval estimates.
Balance Plot
The table above is useful, but a plot makes the severity and direction of imbalance easier to scan. This plot is also a good report artifact because it explains why naive comparisons are not credible.
The dashed lines mark a common practical threshold for imbalance. The visible imbalance means our uncertainty discussion should focus on adjusted estimators rather than raw group differences.
Propensity Overlap
Confidence intervals become less trustworthy when treated and control rows do not overlap well. Here we use the known synthetic propensity to show the assignment support before fitting any estimated model.
The two distributions overlap, but they are not identical. The upper and lower propensity regions will tend to have wider CATE intervals because the estimator has fewer comparable treated-control contrasts there.
Train-Test Split
We split the data before fitting models so that CATE recovery, interval width, and policy diagnostics are evaluated on held-out rows. This mirrors a practical workflow where we train a model, then examine how its estimates behave on data not used for fitting.
The train and test splits have similar treatment and support-risk rates. That matters because large split differences would make later model diagnostics harder to interpret.
Model Matrices
EconML separates the treatment, outcome, and covariates. In this tutorial all observed pre-treatment covariates are used as effect modifiers X. We keep the setup simple so the uncertainty behavior is the main focus.
X_train = train_df[feature_cols].copy()X_test = test_df[feature_cols].copy()y_train = train_df["outcome"].to_numpy()y_test = test_df["outcome"].to_numpy()t_train = train_df["treatment"].to_numpy()t_test = test_df["treatment"].to_numpy()true_tau_train = train_df["true_cate"].to_numpy()true_tau_test = test_df["true_cate"].to_numpy()matrix_summary = pd.DataFrame( {"object": ["X_train", "X_test", "y_train", "t_train"],"shape_or_length": [X_train.shape, X_test.shape, len(y_train), len(t_train)],"description": ["Observed pre-treatment covariates used for heterogeneity.","Held-out covariates used for evaluation and policy diagnostics.","Observed training outcomes.","Observed training treatment indicators.", ], })matrix_summary.to_csv(TABLE_DIR /"09_model_matrix_summary.csv", index=False)display(matrix_summary)
object
shape_or_length
description
0
X_train
(2340, 11)
Observed pre-treatment covariates used for het...
1
X_test
(1260, 11)
Held-out covariates used for evaluation and po...
2
y_train
2340
Observed training outcomes.
3
t_train
2340
Observed training treatment indicators.
The matrices are now in the structure that EconML expects. The same feature_cols list will be reused in the estimators and later diagnostic tables so the analysis stays consistent.
Nuisance Diagnostics
DML estimators use nuisance models for the outcome and treatment assignment processes. If these models are weak, treatment-effect estimates and intervals can degrade. This cell trains simple diagnostic nuisance models on the training split and evaluates them on the test split.
The propensity model has predictive power, which is expected in an observational setup. That reminds us treatment is not random, and it also helps explain why interval width will vary across the covariate space.
Estimated Propensity Buckets
This cell bins held-out rows by estimated propensity. The purpose is to create a support diagnostic we can join to CATE intervals later. Rows near 0 or 1 usually have less counterfactual support than rows near 0.5.
The bucket table turns abstract overlap into something visible. Later, if interval widths are largest in the edge buckets, we will have a concrete explanation: those rows have less balanced treated-control support.
Fit A Causal Forest With Intervals
CausalForestDML is a natural estimator for this notebook because it can estimate nonlinear CATEs and return uncertainty intervals. The forest uses machine-learning nuisance models and honest splitting internally, then estimates heterogeneous effects from the residualized treatment and outcome signal.
Causal forest held-out ATE estimate: 0.3048
95% ATE interval: [0.0691, 0.5406]
The forest now gives us both point estimates and intervals. The ATE interval summarizes uncertainty for the average held-out effect, while the row-level interval arrays will let us inspect uncertainty across the covariate space.
Fit An Interpretable Linear DML Baseline
A causal forest is flexible, but it is useful to compare it with a simpler model. LinearDML estimates a linear CATE function and can return statsmodels-style inference intervals. If the linear model performs much worse on a nonlinear data-generating process, that helps explain why model choice affects both point estimates and uncertainty.
LinearDML held-out ATE estimate: 0.2931
95% ATE interval: [0.2160, 0.3703]
The linear baseline gives a second view of uncertainty. It is more restrictive than the forest, so we should not expect it to recover all nonlinear heterogeneity, but it is valuable as a transparent comparison point.
Compare Model Recovery And Interval Coverage
Because we know the true CATE, we can compute teaching diagnostics that would not be available in production: CATE RMSE, correlation with truth, and empirical interval coverage. These checks help students understand what the intervals are doing.
The summary separates three ideas: whether the average-effect interval covers the true ATE, how well the model recovers row-level heterogeneity, and how wide the row-level intervals are. Strong average inference does not automatically mean every individual CATE is precise.
CATE Recovery Scatter With Interval Width
This plot compares estimated and true CATE for the causal forest. The color represents interval width, so we can see whether the largest errors tend to occur in regions where the model is less certain.
The diagonal line marks perfect recovery. Points far from the diagonal are estimation errors; the color scale helps us see whether those errors are concentrated in high-uncertainty regions.
Interval Calibration By Estimated CATE Decile
A useful diagnostic is to group rows by estimated CATE and compare the average estimate, average true CATE, and average interval. In real data, we would not have the true CATE column, but this teaching view makes the logic of calibration easier to see.
The decile table shows whether higher estimated CATE groups also have higher true CATE on average. The interval columns make the ranking less brittle by showing how much uncertainty surrounds each group.
Calibration Plot With Decile-Level Intervals
The table is precise, but the plot makes the ranking pattern clearer. Each dot is a decile of estimated forest CATE. The vertical bars are the average lower and upper interval bounds for that decile, not a new interval for the decile mean.
The upward pattern tells us the forest ranking contains useful signal. The error bars remind us that the distance between neighboring deciles may not be meaningful when intervals overlap heavily.
Interval Width By Propensity Support
Now we connect uncertainty back to overlap. If the model has weaker counterfactual support in low-overlap rows, the interval width should generally be larger there.
This table translates overlap into interval behavior. Buckets with lower overlap scores or higher support-risk rates are the places where row-level CATE estimates should be treated with extra caution.
Interval Width Scatter
This plot shows each held-out row as a point. The x-axis is estimated overlap support, and the y-axis is the forest interval width. A downward pattern means intervals get narrower as treated-control support improves.
The plot links a statistical object to a data-support explanation. Wide intervals are not random noise in the report; they often point to places where the data provide weaker counterfactual comparisons.
Interval Width Drivers
A quick way to explain interval width is to model it as a function of the observed covariates. This is not a causal model; it is a diagnostic model that asks which features are associated with more uncertain treatment-effect estimates.
The driver table helps explain which features are associated with precision rather than effect size. If a feature drives interval width, that feature may deserve extra support checks before using CATE estimates for targeting.
Plot Interval Width Drivers
The feature-importance table is useful for exact values, while a bar chart gives a faster read of the main uncertainty drivers.
The highest-ranked features tell us where uncertainty concentrates. In a real analysis, this would motivate targeted diagnostics, more data collection, or a more conservative policy in those segments.
ATE Uncertainty Versus CATE Uncertainty
Average effects are often estimated more precisely than individual effects because averaging cancels some noise. This cell puts ATE intervals beside row-level CATE interval summaries so readers do not confuse the two.
The row-level intervals are much wider than the ATE interval. This is the normal tradeoff: personalization is more granular, so it usually carries more uncertainty than a single average-effect estimate.
Alpha Levels And Interval Width
A 95% interval is not the only possible choice. Wider intervals correspond to higher confidence levels, and narrower intervals correspond to lower confidence levels. This cell computes interval width and teaching coverage for several alpha values.
Higher confidence levels produce wider intervals and usually higher teaching coverage. The last column shows the decision cost of being more conservative: fewer rows have a lower bound above zero.
Plot Alpha Sensitivity
This figure shows the interval-width tradeoff across confidence levels. It is a compact way to explain why a policy can become more or less conservative depending on the interval standard used.
The plot makes the decision tradeoff visible. If we demand very high confidence, the targetable group shrinks because the lower bound must clear a stricter uncertainty bar.
Point-Estimate Targeting Versus Lower-Bound Targeting
Now we turn CATE estimates into policy rules. A point-estimate rule treats the highest estimated effects. A lower-bound rule treats rows whose conservative estimate is strongest. The comparison shows how uncertainty changes who gets selected.
The lower-bound policy may select a different group even when the treatment budget is the same. It favors rows with both high estimated benefit and enough precision to make the lower bound attractive.
Plot Targeting Rules
This plot compares policy rules across three quantities: true targeted gain, conservative estimated gain, and support-risk share. The support-risk panel is especially useful because it shows whether a policy is leaning into less certain regions.
The three-panel view shows the practical meaning of uncertainty-aware targeting. A rule can have slightly lower average point estimates but be more defensible if it avoids fragile high-uncertainty selections.
Bootstrap Policy-Value Intervals
The CATE intervals describe row-level treatment-effect uncertainty. Policy stakeholders often also ask: how stable is the value of the selected policy group? This bootstrap resamples held-out rows and estimates the average predicted effect among selected rows. It is a lightweight policy-value uncertainty check, not a full model-refit bootstrap.
def bootstrap_policy_mean(frame, policy_col, value_col, n_bootstrap=1_000):"""Bootstrap the mean predicted treatment effect among rows selected by a policy.""" selected = frame.loc[frame[policy_col] ==1, value_col].to_numpy()iflen(selected) ==0:return {"mean": np.nan, "ci_lower": np.nan, "ci_upper": np.nan, "selected_rows": 0} bootstrap_means = []for _ inrange(n_bootstrap): sample = rng.choice(selected, size=len(selected), replace=True) bootstrap_means.append(sample.mean())return {"mean": selected.mean(),"ci_lower": np.quantile(bootstrap_means, 0.025),"ci_upper": np.quantile(bootstrap_means, 0.975),"selected_rows": len(selected), }bootstrap_rows = []for policy_col, label in [ ("point_rank_policy", "Top 20% by point estimate"), ("lower_bound_rank_policy", "Top 20% by lower bound"), ("positive_lower_bound_policy", "All rows with lower bound > 0"),]: result = bootstrap_policy_mean(policy_df, policy_col, "forest_cate") result["policy"] = label bootstrap_rows.append(result)bootstrap_policy_intervals = pd.DataFrame(bootstrap_rows)[["policy", "selected_rows", "mean", "ci_lower", "ci_upper"]]bootstrap_policy_intervals.to_csv(TABLE_DIR /"09_bootstrap_policy_intervals.csv", index=False)display(bootstrap_policy_intervals)
policy
selected_rows
mean
ci_lower
ci_upper
0
Top 20% by point estimate
252
0.637133
0.627802
0.646134
1
Top 20% by lower bound
252
0.630662
0.620399
0.640281
2
All rows with lower bound > 0
792
0.444357
0.433538
0.457274
The bootstrap intervals summarize how stable the estimated policy value is over the held-out population. These intervals are usually narrower than individual CATE intervals because they average over selected rows.
Plot Bootstrap Policy Intervals
The table gives exact values; the plot makes policy comparison easier. Each point is the average estimated CATE among selected rows, and the horizontal bar is the bootstrap interval.
The policy-value intervals are a practical communication tool. They keep the report from implying false precision about a targeting rule’s average benefit.
Treatment Cost And Conservative Decisions
Many interventions have a cost: operational cost, user-experience cost, fairness cost, or opportunity cost. A treatment should clear that cost, not just be positive. This cell compares point-estimate and lower-bound rules under a simple cost threshold.
The lower-bound rule is stricter because it requires the conservative estimate to exceed cost. This often reduces treatment volume, but it can also reduce the share of selected rows whose true net effect is negative in the teaching data.
Lower-Bound Threshold Curve
A single threshold can feel arbitrary. This cell evaluates a range of lower-bound thresholds so we can see the tradeoff between treating more rows and demanding stronger conservative evidence.
The threshold curve is a decision table. It shows how the treated population shrinks as the lower-bound standard rises, and how that affects expected gain and selected-segment composition.
Plot Threshold Tradeoffs
This figure places the main threshold tradeoffs in one view. The ideal threshold depends on the intervention’s cost, capacity, and tolerance for uncertain selections.
The threshold plot shows why uncertainty-aware targeting is a business decision as well as a modeling decision. More conservative thresholds select fewer rows and can avoid uncertain support-risk regions, but they may leave some positive true effects untreated.
Sample Size Sensitivity
Intervals should generally become narrower as the training sample grows, although the pattern will not be perfectly smooth. This cell refits smaller causal forests on nested training samples and evaluates each model on the same held-out set.
The sample-size diagnostic shows how much precision depends on training data volume. In real work, this can support a recommendation to gather more data before using CATE estimates for fine-grained targeting.
Plot Sample Size Sensitivity
This plot turns the sample-size table into a quick visual check. We expect interval width to fall as data increases, while recovery quality should generally improve.
The plot gives a concrete story for uncertainty: more data can improve precision, but the benefit depends on whether the additional data improves support in the right regions.
Segment-Level Uncertainty
Segment summaries are easier to communicate than thousands of row-level intervals. Here we summarize estimates and interval widths by combinations of need and support risk. This is often how model results become actionable for non-technical stakeholders.
The segment summary shows which groups look promising and which groups remain uncertain. This is usually more reportable than a list of individual predictions.
Plot Segment Uncertainty
This plot compares mean estimated CATE and interval bounds by segment. It keeps the segment view compact while still showing the uncertainty around each group.
The segment plot is a good final artifact for decision-makers. It shows both direction and precision, and it avoids pretending that all segments are equally well supported.
Reporting Checklist
The final table turns the notebook into a reusable checklist. A credible uncertainty report should state the estimand, interval method, overlap issues, decision rule, and limitations in plain language.
reporting_checklist = pd.DataFrame( [ {"item": "Estimand","what_to_report": "State whether the main target is ATE, CATE, or policy value.","why_it_matters": "Different estimands have different uncertainty behavior.", }, {"item": "Interval method","what_to_report": "Name the estimator and interval construction used by the analysis.","why_it_matters": "Readers need to know what uncertainty source is represented.", }, {"item": "Overlap","what_to_report": "Show treatment overlap and where interval width is largest.","why_it_matters": "Weak support can make CATE estimates fragile.", }, {"item": "CATE calibration","what_to_report": "Compare estimated CATE groups and uncertainty summaries.","why_it_matters": "Ranking quality matters more than a single global metric for targeting.", }, {"item": "Decision rule","what_to_report": "Explain whether treatment is assigned by point estimate, lower bound, or cost threshold.","why_it_matters": "The rule determines how uncertainty affects action.", }, {"item": "Policy uncertainty","what_to_report": "Report uncertainty around average selected-group value where possible.","why_it_matters": "Stakeholders usually care about the policy, not only row-level estimates.", }, {"item": "Limits","what_to_report": "State that intervals do not repair omitted confounding, leakage, or bad treatment definitions.","why_it_matters": "Statistical precision is not the same as causal validity.", }, ])reporting_checklist.to_csv(TABLE_DIR /"09_uncertainty_reporting_checklist.csv", index=False)display(reporting_checklist)
item
what_to_report
why_it_matters
0
Estimand
State whether the main target is ATE, CATE, or...
Different estimands have different uncertainty...
1
Interval method
Name the estimator and interval construction u...
Readers need to know what uncertainty source i...
2
Overlap
Show treatment overlap and where interval widt...
Weak support can make CATE estimates fragile.
3
CATE calibration
Compare estimated CATE groups and uncertainty ...
Ranking quality matters more than a single glo...
4
Decision rule
Explain whether treatment is assigned by point...
The rule determines how uncertainty affects ac...
5
Policy uncertainty
Report uncertainty around average selected-gro...
Stakeholders usually care about the policy, no...
6
Limits
State that intervals do not repair omitted con...
Statistical precision is not the same as causa...
The checklist is deliberately practical. It encourages a report that explains what can be trusted, what is uncertain, and what uncertainty does to the final decision.
Summary
This notebook showed how to move from treatment-effect estimates to uncertainty-aware causal decisions.
The main lessons are:
ATE intervals and CATE intervals answer different questions.
Row-level CATE intervals are often much wider than average-effect intervals.
Weak overlap and noisy segments often explain where intervals are widest.
Ranking by point estimate alone can select fragile high-uncertainty rows.
Lower-confidence-bound targeting is a simple way to trade volume for confidence.
Policy-value intervals help communicate the uncertainty around a decision rule.
Confidence intervals do not validate the causal assumptions by themselves; they quantify estimation uncertainty under the model and design.
The next tutorial extends the treatment setup beyond binary treatment and introduces multiple-treatment and continuous-treatment examples.