DoubleML Tutorial 12: Inference, Bootstrap, And Confidence Bands
This notebook is about uncertainty. DoubleML does not only return a treatment-effect point estimate; it also returns standard errors, confidence intervals, bootstrap-based joint intervals, and adjusted p-values for multiple testing.
The basic inference logic is asymptotic. Under regularity conditions, the orthogonal score behaves like an average of approximately mean-zero influence terms. That gives an approximate normal distribution for the estimator:
[ N(0, 1). ]
From that approximation, DoubleML reports standard errors, t-statistics, p-values, and pointwise confidence intervals. But many applied analyses estimate more than one treatment effect. Once several effects are examined together, pointwise intervals can become too optimistic if they are read as a family-level statement. That is where multiplier bootstrap, joint intervals, and multiple-testing adjustment enter.
This notebook uses a synthetic multi-treatment PLR design with three treatment variables:
d_main: a strong positive effect,
d_secondary: a smaller positive effect,
d_null: a true zero effect.
That mix lets us see how inference behaves for strong signals, moderate signals, and null effects. The key teaching goal is not to memorize one API call. The goal is to learn how to report uncertainty honestly:
Use standard errors and pointwise confidence intervals for individual effects.
Use bootstrap-based joint intervals when making simultaneous claims.
Use adjusted p-values when testing several effects at once.
Treat inference as conditional on the causal design and identifying assumptions, not as a substitute for them.
Expected runtime: usually under one minute. The notebook runs one main multi-treatment DoubleML fit, several bootstrap calls, and a small coverage simulation.
Setup
This cell prepares the output folders, imports DoubleML and scientific Python tools, and applies a few narrow warning filters for known notebook-environment noise. The code remains visible because this tutorial is meant to be studied and rerun.
The setup confirms the installed DoubleML version and where generated artifacts will be saved. Every table and figure in this notebook uses prefix 12.
Helper Functions
The helper functions below keep the tutorial code compact. They save artifacts, build a DoubleML backend, compute nuisance prediction metrics, and format confidence intervals.
The inference helpers intentionally use transparent normal-theory formulas. DoubleML already computes these quantities, but writing them explicitly makes the connection between standard errors, z critical values, and interval width easier to see.
These helpers are reusable. The key one for this notebook is ci_from_coef_se(), which mirrors the pointwise normal interval logic that appears in the DoubleML summary.
Inference Vocabulary
The vocabulary table clarifies what the notebook means by standard error, confidence interval, joint interval, and adjusted p-value. These terms are closely related, but they answer different reporting questions.
inference_vocabulary = pd.DataFrame( [ {"term": "Standard error","meaning": "Estimated sampling variability of the treatment-effect estimator.","reporting question": "How noisy is this point estimate?", }, {"term": "Pointwise confidence interval","meaning": "Interval for one parameter using a marginal critical value.","reporting question": "What range is plausible for this one effect?", }, {"term": "Multiplier bootstrap","meaning": "Resamples score contributions through random multipliers to approximate estimator uncertainty.","reporting question": "How can we approximate the distribution of t-statistics without refitting nuisances many times?", }, {"term": "Joint confidence interval","meaning": "Simultaneous interval family designed to cover all selected parameters together.","reporting question": "What ranges are plausible if I discuss all effects as a family?", }, {"term": "Adjusted p-value","meaning": "P-value corrected for multiple testing across several hypotheses.","reporting question": "Which effects remain evidence-bearing after accounting for multiple looks?", }, ])save_table(inference_vocabulary, f"{NOTEBOOK_PREFIX}_inference_vocabulary.csv")display(inference_vocabulary)
term
meaning
reporting question
0
Standard error
Estimated sampling variability of the treatmen...
How noisy is this point estimate?
1
Pointwise confidence interval
Interval for one parameter using a marginal cr...
What range is plausible for this one effect?
2
Multiplier bootstrap
Resamples score contributions through random m...
How can we approximate the distribution of t-s...
3
Joint confidence interval
Simultaneous interval family designed to cover...
What ranges are plausible if I discuss all eff...
4
Adjusted p-value
P-value corrected for multiple testing across ...
Which effects remain evidence-bearing after ac...
The distinction between pointwise and joint uncertainty is central. A pointwise 95% interval is not the same thing as a 95% statement about several intervals at once.
Synthetic Multi-Treatment PLR Data
We generate a linear nuisance design so the inference examples are clean. The outcome has three treatment variables: two have true nonzero effects and one has a true zero effect. The nuisance functions depend on observed controls, so adjustment is still necessary.
The data-generating process is intentionally simple enough for Ridge learners to perform well. This notebook is about inference mechanics, not learner complexity.
The first rows show the observed controls, oracle nuisance columns, treatments, and outcome. The oracle columns are present only for simulation documentation and are excluded from the DoubleML control matrix.
Data Dictionary And Audit
Inference should be reported after the design is clear. This audit records the treatment roles, true simulated effects, and basic confounding correlations.
The audit shows that treatments are related to their nuisance functions and, for some treatments, to outcome-relevant structure. That is why causal inference needs adjustment before uncertainty reporting.
Inference Workflow Diagram
This diagram shows the inference stack. The point estimate and standard error come from the fitted orthogonal score. Bootstrap then approximates the distribution of score-based t-statistics, which supports joint intervals and Romano-Wolf-style p-value adjustment.
The workflow separates marginal inference from simultaneous inference. Both are useful, but they answer different questions.
Fit A Multi-Treatment DoubleMLPLR Model
We fit one PLR model with three treatment columns. DoubleML estimates one coefficient per treatment and returns a summary table with standard errors, t-statistics, p-values, and pointwise intervals.
The learner is a scaled RidgeCV pipeline. Scaling is inside the pipeline so preprocessing is done inside the cross-fitting folds rather than globally leaking information.
The summary table is the default inference view. It is pointwise: each row is interpreted as a separate treatment-effect estimate unless we add joint inference or p-value adjustment.
Standard Error And Pointwise Inference Table
This cell converts the DoubleML summary into an explicit reporting table and adds the known true effect from the simulation. In real data, the true effect column would not exist; it is included here only for learning.
The strong and moderate effects are clearly separated from zero, while the null effect is not. The nuisance-quality table records that inference was built from cross-fitted nuisance predictions rather than in-sample fits.
Confidence Levels And Interval Widths
Confidence intervals widen as the confidence level increases. This cell computes 90%, 95%, and 99% pointwise intervals manually from the coefficient and standard error.
The table shows the mechanical relationship between confidence level and interval width. Higher confidence requires a larger critical value, so the interval becomes wider.
Confidence Level Plot
The plot below shows the same idea visually. Each treatment receives three intervals, one per confidence level.
The x markers are the known true effects from the simulation. In applied work those markers do not exist, so the interval itself must be interpreted as an uncertainty statement under the design assumptions.
Multiplier Bootstrap
DoubleML’s bootstrap() method draws random multipliers for the estimated score contributions. This approximates the distribution of t-statistics without refitting all nuisance models for every bootstrap draw.
We use 800 bootstrap replications here. That is enough for a stable teaching plot while remaining fast.
The bootstrap stores t-statistics for each treatment and each repeated sample split. The maximum absolute t-statistic distribution is what makes simultaneous inference wider than pointwise inference.
Bootstrap T-Statistic Distribution
This figure shows the bootstrap t-statistic distribution by treatment, plus the distribution of the maximum absolute t-statistic used for joint intervals.
The max-statistic distribution has a larger critical value than the usual 1.96 pointwise normal cutoff. That is why joint intervals are wider.
Pointwise Versus Joint Confidence Intervals
Now we compare pointwise 95% confidence intervals with bootstrap-based joint 95% confidence intervals. Joint intervals are appropriate when reporting several effects as a family and wanting simultaneous coverage.
The joint intervals are wider because they are calibrated to cover the collection of effects together. This is the right trade-off when the write-up makes simultaneous claims.
Pointwise Versus Joint Interval Plot
The plot below overlays the two interval types. Pointwise intervals are blue; joint intervals are orange. The black x marks the known true effect.
The null treatment interval includes zero under both approaches. The nonzero treatments remain clearly positive, even after joint calibration.
Bootstrap Method Comparison
DoubleML supports several multiplier bootstrap methods: normal, Bayes, and wild. In many well-behaved examples they give similar results, but comparing them is a useful robustness check.
bootstrap_method_rows = []for method in ["normal", "Bayes", "wild"]: np.random.seed(RANDOM_SEED) inference_model.bootstrap(method=method, n_rep_boot=500) method_joint_ci = inference_model.confint(joint=True, level=0.95).reset_index().rename(columns={"index": "treatment", "2.5 %": "joint_lower", "97.5 %": "joint_upper"}) method_joint_ci["method"] = method method_joint_ci["joint_width"] = method_joint_ci["joint_upper"] - method_joint_ci["joint_lower"] bootstrap_method_rows.append(method_joint_ci)bootstrap_method_comparison = pd.concat(bootstrap_method_rows, ignore_index=True)save_table(bootstrap_method_comparison, f"{NOTEBOOK_PREFIX}_bootstrap_method_comparison.csv")display(bootstrap_method_comparison)# Restore the normal bootstrap for the p-value adjustment sections below.np.random.seed(RANDOM_SEED)inference_model.bootstrap(method="normal", n_rep_boot=800)
treatment
joint_lower
joint_upper
method
joint_width
0
d_main
0.913914
1.071688
normal
0.157774
1
d_secondary
0.385135
0.537902
normal
0.152767
2
d_null
-0.093496
0.060591
normal
0.154087
3
d_main
0.913589
1.072814
Bayes
0.159226
4
d_secondary
0.383071
0.538794
Bayes
0.155723
5
d_null
-0.097451
0.064545
Bayes
0.161996
6
d_main
0.914831
1.071966
wild
0.157136
7
d_secondary
0.383897
0.538196
wild
0.154299
8
d_null
-0.096607
0.063702
wild
0.160309
<doubleml.plm.plr.DoubleMLPLR at 0x7a08c517b620>
The methods produce similar joint intervals in this clean synthetic design. Bigger differences would be a signal to inspect sample size, score behavior, and model stability.
Multiple Testing Adjustment
When testing several effects, unadjusted p-values answer each test separately. Adjusted p-values account for the fact that several hypotheses are being examined. DoubleML supports Romano-Wolf adjustment through the bootstrap, and also methods such as Holm and Bonferroni through statsmodels.
The two true nonzero effects remain significant after adjustment, while the null treatment remains non-significant. That is the behavior we hope to see in this teaching simulation.
Adjusted P-Value Plot
The plot below shows pointwise and adjusted p-values on a log scale. The dashed line marks 0.05.
The log scale makes tiny p-values visible while still showing the null treatment near the non-significant region. Adjustment matters most when there are several borderline effects.
Small Coverage Simulation
This short simulation repeats a simple single-treatment PLR design several times and checks whether the pointwise 95% confidence interval covers the known true effect. It is intentionally small and should be read as a teaching sketch, not a formal Monte Carlo study.
A real simulation study would use many more repetitions, multiple sample sizes, and a fixed analysis plan.
The empirical coverage rate will not be exactly 95% with only 30 simulations. The point is to connect the interval formula to repeated-sampling behavior: over many hypothetical datasets, valid 95% intervals should cover the true effect about 95% of the time.
Coverage Simulation Plot
Each horizontal line is one simulated 95% confidence interval. Blue intervals cover the true effect; red intervals miss it.
Coverage is a repeated-sampling idea. A single applied interval either contains the true effect or it does not, but we never observe that directly. The simulation is here to make the concept concrete.
Reporting Checklist
This checklist summarizes what should be reported when presenting DoubleML inference results. The emphasis is on clarity: point estimate, uncertainty type, bootstrap settings, and multiple-testing choices should all be visible.
inference_reporting_checklist = pd.DataFrame( [ {"item": "State the estimand and treatment columns", "why": "Inference is only meaningful after the target is clear."}, {"item": "Report coefficient, standard error, and confidence interval", "why": "A point estimate without uncertainty is incomplete."}, {"item": "Say whether intervals are pointwise or joint", "why": "These intervals answer different reporting questions."}, {"item": "Document bootstrap method and number of replications", "why": "Joint intervals and Romano-Wolf adjustment depend on bootstrap settings."}, {"item": "Use adjusted p-values for multiple treatment tests", "why": "Several tests increase the chance of false discoveries."}, {"item": "Report nuisance model and cross-fitting choices", "why": "Uncertainty is conditional on the fitted DoubleML design."}, {"item": "Separate statistical uncertainty from identification uncertainty", "why": "Bootstrap cannot fix omitted variables, bad timing, or invalid instruments."}, ])save_table(inference_reporting_checklist, f"{NOTEBOOK_PREFIX}_inference_reporting_checklist.csv")display(inference_reporting_checklist)
item
why
0
State the estimand and treatment columns
Inference is only meaningful after the target ...
1
Report coefficient, standard error, and confid...
A point estimate without uncertainty is incomp...
2
Say whether intervals are pointwise or joint
These intervals answer different reporting que...
3
Document bootstrap method and number of replic...
Joint intervals and Romano-Wolf adjustment dep...
4
Use adjusted p-values for multiple treatment t...
Several tests increase the chance of false dis...
5
Report nuisance model and cross-fitting choices
Uncertainty is conditional on the fitted Doubl...
6
Separate statistical uncertainty from identifi...
Bootstrap cannot fix omitted variables, bad ti...
The last row is the most important. Statistical inference quantifies sampling uncertainty under the design; it does not prove the design is correct.
Report Template And Artifact Manifest
The final cell writes a report template and an artifact manifest. This keeps the tutorial reproducible and gives a structure for future applied analyses.
report_text =f"""# DoubleML Inference Report Template## Estimand And Model- Outcome:- Treatment column(s):- Control set:- DoubleML model class:- Nuisance learners:- Cross-fitting design:## Point Estimates- Coefficient table:- Standard errors:- Pointwise confidence intervals:## Bootstrap And Simultaneous Inference- Bootstrap method:- Bootstrap replications:- Joint confidence intervals:- Adjusted p-value method:## Main Findings- Effects clearly separated from zero:- Effects sensitive to adjustment:- Null or inconclusive effects:## Caveats- Identification assumptions:- Sample size and split stability:- Multiple testing:- Remaining uncertainty not captured by standard errors:""".strip()report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_inference_report_template.md"report_path.write_text(report_text)artifact_manifest = pd.DataFrame( [ {"artifact": "synthetic multi-treatment PLR data", "path": str(DATASET_DIR /f"{NOTEBOOK_PREFIX}_synthetic_multitreatment_plr_data.csv")}, {"artifact": "standard inference summary", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_standard_inference_summary.csv")}, {"artifact": "pointwise vs joint intervals", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_pointwise_vs_joint_ci.csv")}, {"artifact": "adjusted p-value comparison", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_adjusted_pvalue_comparison.csv")}, {"artifact": "coverage simulation", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_coverage_simulation.csv")}, {"artifact": "report template", "path": str(report_path)}, {"artifact": "inference workflow figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_inference_workflow.png")}, {"artifact": "pointwise vs joint interval figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_pointwise_vs_joint_ci.png")}, ])save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")display(Markdown(f"Report template written to `{report_path}`"))display(artifact_manifest)
Report template written to /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/12_inference_report_template.md
artifact
path
0
synthetic multi-treatment PLR data
/home/apex/Documents/ranking_sys/notebooks/tut...
1
standard inference summary
/home/apex/Documents/ranking_sys/notebooks/tut...
2
pointwise vs joint intervals
/home/apex/Documents/ranking_sys/notebooks/tut...
3
adjusted p-value comparison
/home/apex/Documents/ranking_sys/notebooks/tut...
4
coverage simulation
/home/apex/Documents/ranking_sys/notebooks/tut...
5
report template
/home/apex/Documents/ranking_sys/notebooks/tut...
6
inference workflow figure
/home/apex/Documents/ranking_sys/notebooks/tut...
7
pointwise vs joint interval figure
/home/apex/Documents/ranking_sys/notebooks/tut...
The notebook now covers the core DoubleML inference workflow: standard errors, pointwise intervals, multiplier bootstrap, joint intervals, adjusted p-values, and reporting discipline.
What Comes Next
The next natural topic is sensitivity analysis for unobserved confounding: how DoubleML represents robustness to omitted confounders and how to explain those sensitivity parameters in a way that is useful rather than mystical.