DoubleML Tutorial 18: End-To-End DoubleML Case Study
This notebook closes the DoubleML tutorial series with a full case study. The goal is not to introduce one more estimator in isolation. The goal is to show how a complete DoubleML analysis should read from start to finish: causal question, estimand, data roles, estimator choice, learners, diagnostics, uncertainty, sensitivity checks, subgroup exploration, and a short final report.
The case study asks a generic product analytics question:
What is the causal effect of receiving a guided onboarding nudge on next-period user value?
The treatment is binary: a user either receives the nudge or does not. The outcome is continuous: a next-period value index. Because treatment is observational rather than randomized, treated and untreated users differ in baseline intent, engagement, tenure, support needs, and other pre-treatment signals.
The target estimand is the average treatment effect:
where \(Y(1)\) is the potential outcome if a user receives the nudge and \(Y(0)\) is the potential outcome if the same user does not. Identification requires the usual observational assumptions:
\[
\left(Y(1), Y(0)\right) \perp D \mid X,
\]
\[
0 < \mathbb{P}(D = 1 \mid X) < 1,
\]
and stable measurement of treatment, outcome, and controls. The first condition says the observed pre-treatment controls \(X\) are sufficient for adjustment. The second says treated and untreated users overlap at comparable covariate values.
For a binary treatment, DoubleML’s interactive regression model estimates two outcome nuisance functions and one propensity nuisance function:
The strength of this workflow is not that it makes observational data magically causal. Its strength is that it gives us a principled estimate after we have made the design assumptions explicit and checked the numerical risks.
Setup
This setup cell imports the full case-study stack and creates the shared tutorial output folders. The repository-root detection keeps paths correct whether the notebook is executed from the repo root or directly from the notebook directory.
from pathlib import Pathimport osimport warningsPROJECT_ROOT = Path.cwd().resolve()while PROJECT_ROOT != PROJECT_ROOT.parent andnot (PROJECT_ROOT /"pyproject.toml").exists(): PROJECT_ROOT = PROJECT_ROOT.parentifnot (PROJECT_ROOT /"pyproject.toml").exists():raiseFileNotFoundError("Could not locate pyproject.toml; run this notebook from inside the repository.")OUTPUT_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"doubleml"/"outputs"DATASET_DIR = OUTPUT_DIR /"datasets"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"REPORT_DIR = OUTPUT_DIR /"reports"MPLCONFIG_DIR = OUTPUT_DIR /"matplotlib_cache"for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MPLCONFIG_DIR]: directory.mkdir(parents=True, exist_ok=True)os.environ["MPLCONFIGDIR"] =str(MPLCONFIG_DIR)warnings.filterwarnings("ignore", message="IProgress not found.*")warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message="The estimated nu2 .*", category=UserWarning)import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport doubleml as dmlfrom doubleml.utils.propensity_score_processing import PSProcessorConfigfrom matplotlib.patches import FancyArrowPatch, FancyBboxPatchfrom scipy.special import expitfrom sklearn.base import clonefrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, HistGradientBoostingClassifier, HistGradientBoostingRegressorfrom sklearn.linear_model import LassoCV, LinearRegression, LogisticRegressionfrom sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import KFoldfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalersns.set_theme(style="whitegrid", context="talk")RANDOM_SEED =1818NOTEBOOK_PREFIX ="18"TRUE_TARGET_NAME ="true_ate"print(f"DoubleML version: {dml.__version__}")print(f"Writing artifacts under: {OUTPUT_DIR}")
The setup prints the DoubleML version and output root. Every generated file uses prefix 18, which makes the final artifact manifest easy to audit.
Helper Functions
The helper functions below are deliberately practical. A real analysis usually repeats the same actions many times: saving tables, fitting a model, extracting predictions, computing overlap diagnostics, building a doubly robust signal, and producing a compact estimate row.
These helpers do not hide the analysis. They standardize repeated mechanical work so the notebook can focus on design choices and diagnostics. The doubly_robust_signal() helper mirrors the ATE score described in the introduction.
Case Study Roadmap
The roadmap table turns the rest of the notebook into a review checklist. End-to-end causal work should be readable in this order: design first, estimation second, diagnostics third, reporting last.
roadmap = pd.DataFrame( [ {"stage": "Causal question", "main_question": "What treatment, outcome, population, and estimand are we studying?"}, {"stage": "Data design", "main_question": "Which variables are pre-treatment controls, treatment, outcome, and invalid post-treatment signals?"}, {"stage": "Estimator choice", "main_question": "Why is IRM appropriate for a binary treatment?"}, {"stage": "Main estimate", "main_question": "What does DoubleML estimate after cross-fitting nuisance functions?"}, {"stage": "Diagnostics", "main_question": "Do propensity scores overlap, do nuisance models behave plausibly, and are results split-stable?"}, {"stage": "Heterogeneity", "main_question": "Which user segments appear to benefit more, and how should that be reported cautiously?"}, {"stage": "Sensitivity", "main_question": "How strong would hidden confounding need to be to threaten the conclusion?"}, {"stage": "Report", "main_question": "What evidence, assumptions, and limitations should a reader see?"}, ])save_table(roadmap, f"{NOTEBOOK_PREFIX}_case_study_roadmap.csv")display(roadmap)
stage
main_question
0
Causal question
What treatment, outcome, population, and estim...
1
Data design
Which variables are pre-treatment controls, tr...
2
Estimator choice
Why is IRM appropriate for a binary treatment?
3
Main estimate
What does DoubleML estimate after cross-fittin...
4
Diagnostics
Do propensity scores overlap, do nuisance mode...
5
Heterogeneity
Which user segments appear to benefit more, an...
6
Sensitivity
How strong would hidden confounding need to be...
7
Report
What evidence, assumptions, and limitations sh...
The roadmap gives the notebook a story. The key habit is that the causal question and variable timing come before model fitting.
Simulate The Case Study Data
We simulate data with known truth so students can see when the workflow succeeds and where it remains vulnerable. The observed controls are all measured before treatment. The treatment is onboarding_nudge, and the outcome is next_period_value.
Treatment assignment is observational:
\[
D_i \sim \text{Bernoulli}(m_0(X_i)).
\]
The outcome is generated from baseline value plus a heterogeneous treatment effect:
The first rows include the teaching-only columns true_propensity and true_treatment_effect. Those columns are not included in the DoubleML feature set. They are kept so we can compare the estimated ATE to the known truth.
Field Dictionary
This field dictionary documents the timing and role of every column. In an observational analysis, this table is not decorative. It is the difference between valid adjustment and accidental leakage.
field_dictionary = pd.DataFrame( [ {"field": "engagement_score", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline engagement before nudge eligibility."}, {"field": "intent_score", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline intent or product-fit signal."}, {"field": "content_breadth", "role": "pre-treatment control", "used_in_model": True, "description": "Breadth of prior usage across app areas."}, {"field": "price_sensitivity", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline sensitivity or friction signal."}, {"field": "mobile_share", "role": "pre-treatment control", "used_in_model": True, "description": "Share of prior activity on mobile."}, {"field": "weekend_share", "role": "pre-treatment control", "used_in_model": True, "description": "Share of prior activity occurring on weekends."}, {"field": "account_age_weeks", "role": "pre-treatment control", "used_in_model": True, "description": "Tenure before treatment eligibility."}, {"field": "new_user", "role": "pre-treatment control", "used_in_model": True, "description": "Indicator derived from account age."}, {"field": "email_opt_in", "role": "pre-treatment control", "used_in_model": True, "description": "Baseline communication eligibility."}, {"field": "support_contacts", "role": "pre-treatment control", "used_in_model": True, "description": "Prior support-contact count."}, {"field": "onboarding_nudge", "role": "treatment", "used_in_model": "treatment column", "description": "Binary exposure to the guided onboarding nudge."}, {"field": "next_period_value", "role": "outcome", "used_in_model": "outcome column", "description": "Continuous value index after treatment."}, {"field": "intent_segment", "role": "reporting segment", "used_in_model": False, "description": "Segment used for subgroup summaries."}, {"field": "post_treatment_activity_proxy", "role": "post-treatment leakage", "used_in_model": False, "description": "Invalid control retained only as a warning example."}, {"field": "true_propensity", "role": "oracle teaching column", "used_in_model": False, "description": "Known treatment probability from simulation."}, {"field": "true_treatment_effect", "role": "oracle teaching column", "used_in_model": False, "description": "Known individual treatment effect from simulation."}, ])save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")display(field_dictionary)
field
role
used_in_model
description
0
engagement_score
pre-treatment control
True
Baseline engagement before nudge eligibility.
1
intent_score
pre-treatment control
True
Baseline intent or product-fit signal.
2
content_breadth
pre-treatment control
True
Breadth of prior usage across app areas.
3
price_sensitivity
pre-treatment control
True
Baseline sensitivity or friction signal.
4
mobile_share
pre-treatment control
True
Share of prior activity on mobile.
5
weekend_share
pre-treatment control
True
Share of prior activity occurring on weekends.
6
account_age_weeks
pre-treatment control
True
Tenure before treatment eligibility.
7
new_user
pre-treatment control
True
Indicator derived from account age.
8
email_opt_in
pre-treatment control
True
Baseline communication eligibility.
9
support_contacts
pre-treatment control
True
Prior support-contact count.
10
onboarding_nudge
treatment
treatment column
Binary exposure to the guided onboarding nudge.
11
next_period_value
outcome
outcome column
Continuous value index after treatment.
12
intent_segment
reporting segment
False
Segment used for subgroup summaries.
13
post_treatment_activity_proxy
post-treatment leakage
False
Invalid control retained only as a warning exa...
14
true_propensity
oracle teaching column
False
Known treatment probability from simulation.
15
true_treatment_effect
oracle teaching column
False
Known individual treatment effect from simulat...
The feature set includes only pre-treatment controls. The post-treatment proxy is intentionally excluded. A real case study should include this kind of timing table before any model results.
Data Audit
The data audit checks sample size, treatment rate, outcome scale, known true ATE, and propensity overlap from the simulator. In a real observational dataset, the oracle columns would not be available, but the same audit structure still applies.
The treatment rate and propensity quantiles suggest there is meaningful treated and untreated support. This does not prove overlap after flexible estimation, so we will inspect estimated propensities later.
Design Diagram
The diagram below now focuses only on the causal design. The nuisance models used by DoubleML are introduced later in the estimation section. Keeping the graph causal makes the arrows easier to read: baseline controls affect both nudge assignment and future value, the nudge affects future value, baseline profile can modify the treatment effect, and dashed arrows mark unmeasured-confounding risk.
The dashed arrows represent residual hidden-confounding risk. DoubleML can estimate an orthogonal score for the observed design, but it cannot verify that all relevant confounders are measured.
Naive And Adjusted Baselines
Before fitting DoubleML, we compute two simple baselines. The naive difference in means ignores confounding. The adjusted linear regression controls for the same pre-treatment features but uses a simple linear outcome model. These are comparison points, not the final estimator.
The naive estimate is expected to be biased because treatment assignment depends on baseline intent and other features. The adjusted linear regression is a stronger baseline, but it still imposes simple functional-form assumptions.
Fit The Main DoubleML IRM Model
The main model uses DoubleMLIRM because the treatment is binary. We use random forests for both outcome nuisance functions and the propensity model. Cross-fitting ensures that the score contribution for each row uses nuisance predictions from models that did not train on that row.
The main DoubleML estimate should be read with its confidence interval, not only the point estimate. Because this is a simulation, we can also compare it to the true ATE. In real data, that column would be unavailable.
Main Estimate Plot
The next plot compares the naive baseline, adjusted linear regression, and main DoubleML estimate against the known true ATE. This is a teaching plot: in real applications, the true vertical line would be replaced by a design-based benchmark or omitted entirely.
The visual comparison shows why end-to-end analysis is useful. A flexible orthogonal estimator should move the answer toward the known causal target when the adjustment set is correct and overlap is reasonable.
Nuisance Diagnostics
IRM relies on outcome predictions for treated and untreated potential outcomes and a propensity prediction for treatment assignment. The diagnostics below summarize the held-out nuisance performance produced by cross-fitting.
These diagnostics are numerical checks, not causal proof. Good nuisance performance supports the estimation workflow, but the identification assumptions still come from the study design.
Propensity Overlap Diagnostics
Overlap is central for binary-treatment causal inference. If estimated propensities are extremely close to zero or one, the doubly robust score can become unstable because inverse-propensity terms get large.
The estimated propensity distribution has no mass near the extreme boundaries after clipping. That makes the ATE estimate easier to trust numerically than a setting with many near-deterministic treatment assignments.
Plot Estimated Propensity Overlap
The histogram compares estimated propensity scores for treated and untreated users. We want enough overlap that both groups appear across the central support.
The treated and untreated distributions overlap in the middle of the propensity range. If one group appeared only near zero or one, the ATE would rely heavily on extrapolation.
Compare Learner Families
A complete case study should check whether the main result is a product of one learner choice. We compare three nuisance-model families: a linear/logistic baseline, random forests, and histogram gradient boosting.
The learner comparison should be read as robustness evidence. If all reasonable learner families tell a similar story, the result is less dependent on a single modeling choice. If they diverge, the report should slow down and explain why.
Plot Learner Family Estimates
The plot below compares point estimates and confidence intervals across nuisance learner choices. The true ATE line is available only because this is a teaching simulation.
The estimates are close enough to support a coherent main story. The learner table still belongs in the final report because it shows the result is not simply a random-forest artifact.
Doubly Robust Signal For Segment Diagnostics
The doubly robust signal is useful for exploratory subgroup summaries. Here we use it to ask whether high-intent users appear to benefit more from the nudge. This is a diagnostic use of the score signal, not a replacement for a formal policy-learning or GATE analysis.
The segment summary should move in the same direction as the known heterogeneous effect pattern. In real data, this would be a hypothesis-generating diagnostic unless accompanied by a formal heterogeneity design.
Plot Segment Effects
The segment plot compares doubly robust subgroup means with the oracle segment effects from the simulator. The oracle points are included only for teaching.
The high-intent segment shows a larger effect, matching the way the teaching data were generated. The important reporting habit is to label this as subgroup evidence, not as an automatically deployable targeting rule.
Sample-Split Stability
Cross-fitting uses random folds. A single split can occasionally be lucky or unlucky. We refit the main random-forest specification over several deterministic fold seeds to check whether the estimate is stable.
The repeated-split standard deviation is a practical stability check. If fold randomness changed the conclusion, the final report would need to say so and rely on repeated cross-fitting or a more stable design.
Plot Sample-Split Stability
This plot shows the repeated estimates against the main estimate and true ATE. It is a compact way to communicate whether cross-fitting randomness matters materially.
The repeated estimates cluster around the same value. This supports the numerical stability of the main estimate under alternative fold assignments.
Sensitivity Analysis For Hidden Confounding
The final statistical check uses DoubleML’s sensitivity analysis. The parameters cf_y and cf_d describe the strength of an unobserved confounder in the outcome and treatment equations. The parameter rho controls the direction of confounding. These are not discovered from the data; they are stress-test settings.
def extract_sensitivity_row(model, scenario, cf_y, cf_d, rho): params = model.sensitivity_paramsreturn {"scenario": scenario,"cf_y": cf_y,"cf_d": cf_d,"rho": rho,"theta_lower": float(params["theta"]["lower"][0]),"theta": float(model.coef[0]),"theta_upper": float(params["theta"]["upper"][0]),"ci_lower": float(params["ci"]["lower"][0]),"ci_upper": float(params["ci"]["upper"][0]),"rv_percent": float(params["rv"][0] *100),"rva_percent": float(params["rva"][0] *100), }sensitivity_specs = [ ("mild", 0.01, 0.01, 1.0), ("moderate", 0.03, 0.03, 1.0), ("strong", 0.06, 0.06, 1.0),]sensitivity_rows = []for scenario, cf_y, cf_d, rho in sensitivity_specs: main_irm.sensitivity_analysis(cf_y=cf_y, cf_d=cf_d, rho=rho, level=0.95, null_hypothesis=0.0) sensitivity_rows.append(extract_sensitivity_row(main_irm, scenario, cf_y, cf_d, rho))sensitivity_summary = pd.DataFrame(sensitivity_rows)save_table(sensitivity_summary, f"{NOTEBOOK_PREFIX}_sensitivity_summary.csv")display(sensitivity_summary)# Restore the moderate scenario as the active sensitivity setting for anyone inspecting the model object later.main_irm.sensitivity_analysis(cf_y=0.03, cf_d=0.03, rho=1.0, level=0.95, null_hypothesis=0.0)
scenario
cf_y
cf_d
rho
theta_lower
theta
theta_upper
ci_lower
ci_upper
rv_percent
rva_percent
0
mild
0.01
0.01
1.0
0.440392
0.461497
0.482602
0.377922
0.544962
19.694041
17.213465
1
moderate
0.03
0.03
1.0
0.397532
0.461497
0.525462
0.334932
0.587729
19.694041
17.213465
2
strong
0.06
0.06
1.0
0.331541
0.461497
0.591453
0.268705
0.653612
19.694041
17.213465
<doubleml.irm.irm.DoubleMLIRM at 0x78c3c2bb5fd0>
The sensitivity table asks how much the estimate could move under increasingly strong hidden-confounding scenarios. It does not prove that hidden confounding is absent. It makes the remaining assumption visible.
Plot Sensitivity Bounds
The sensitivity plot shows how the lower and upper treatment-effect bounds change as the hidden-confounding stress test gets stronger.
The sensitivity bounds communicate the main limitation of observational analysis: the result can be numerically precise and still depend on an untestable adjustment assumption.
Final Evidence Scorecard
The scorecard pulls together the pieces a reviewer should see before trusting the result. It includes design, overlap, nuisance quality, learner robustness, split stability, heterogeneity, and sensitivity.
evidence_scorecard = pd.DataFrame( [ {"evidence_area": "Design","finding": "Only pre-treatment controls are used in the main feature set.","status": "documented", }, {"evidence_area": "Main estimate","finding": f"Main IRM ATE is {main_row['estimate']:.3f} with 95% CI [{main_row['ci_95_lower']:.3f}, {main_row['ci_95_upper']:.3f}].","status": "estimated", }, {"evidence_area": "Overlap","finding": f"Estimated propensity p05={main_row['propensity_p05']:.3f}, p95={main_row['propensity_p95']:.3f}.","status": "checked", }, {"evidence_area": "Learner robustness","finding": f"Learner-family estimates range from {learner_comparison['estimate'].min():.3f} to {learner_comparison['estimate'].max():.3f}.","status": "checked", }, {"evidence_area": "Sample splitting","finding": f"Repeated split SD is {split_stability_summary.loc[0, 'sd_estimate']:.4f}.","status": "checked", }, {"evidence_area": "Heterogeneity","finding": "DR signal summaries suggest larger effects for higher-intent users.","status": "exploratory", }, {"evidence_area": "Sensitivity","finding": "Sensitivity bounds are reported for mild, moderate, and strong hidden-confounding scenarios.","status": "stress tested", }, {"evidence_area": "Limitations","finding": "Unmeasured confounding, measurement error, and deployment interference remain design risks.","status": "must report", }, ])save_table(evidence_scorecard, f"{NOTEBOOK_PREFIX}_evidence_scorecard.csv")display(evidence_scorecard)
evidence_area
finding
status
0
Design
Only pre-treatment controls are used in the ma...
documented
1
Main estimate
Main IRM ATE is 0.461 with 95% CI [0.387, 0.536].
estimated
2
Overlap
Estimated propensity p05=0.209, p95=0.732.
checked
3
Learner robustness
Learner-family estimates range from 0.334 to 0...
checked
4
Sample splitting
Repeated split SD is 0.0066.
checked
5
Heterogeneity
DR signal summaries suggest larger effects for...
exploratory
6
Sensitivity
Sensitivity bounds are reported for mild, mode...
stress tested
7
Limitations
Unmeasured confounding, measurement error, and...
must report
The scorecard is a compact review object. It is the bridge between a technical notebook and a reader who wants to know whether the analysis is credible.
Write The Final Case Study Report
The report template below is written as if this were the final deliverable. It includes the causal question, estimator, main estimate, diagnostics, sensitivity, subgroup findings, and limitations.
report_text =rf"""# End-To-End DoubleML Case Study Report## Causal QuestionEstimate the average effect of receiving a guided onboarding nudge on next-period user value.## EstimandThe target estimand is the average treatment effect:$$\theta_0 = \mathbb{{E}}[Y(1) - Y(0)].$$## Identification AssumptionsThe analysis assumes conditional exchangeability given the documented pre-treatment controls, overlap between treated and untreated users, stable treatment definition, and no interference across users.## Main DoubleML Specification- Estimator: `DoubleMLIRM`- Score: ATE- Outcome learner: random forest regressor- Propensity learner: random forest classifier- Cross-fitting: 5 folds- Propensity clipping threshold: 0.02## Main Estimate- Estimated ATE: {main_row['estimate']:.4f}- Standard error: {main_row['std_error']:.4f}- 95% confidence interval: [{main_row['ci_95_lower']:.4f}, {main_row['ci_95_upper']:.4f}]- Known true ATE in this teaching simulation: {main_row['true_ate']:.4f}## Diagnostics- Estimated propensity p05 / p95: {main_row['propensity_p05']:.4f} / {main_row['propensity_p95']:.4f}- Outcome RMSE under control model: {main_row['ml_g0_rmse']:.4f}- Outcome RMSE under treated model: {main_row['ml_g1_rmse']:.4f}- Propensity log loss: {main_row['ml_m_log_loss']:.4f}- Sample-split estimate SD: {split_stability_summary.loc[0, 'sd_estimate']:.4f}## HeterogeneityThe doubly robust signal suggests larger effects for higher-intent users. Treat this as subgroup evidence, not as a fully validated targeting policy.## SensitivitySensitivity bounds are saved for mild, moderate, and strong hidden-confounding scenarios. These stress tests do not remove hidden-confounding risk; they describe how the estimate would move under specified scenarios.## LimitationsThe analysis remains observational. It can be threatened by missing confounders, bad measurement, outcome leakage, violations of overlap, interference, or changes in the treatment definition.## Artifact Paths- Data: `{DATASET_DIR /f'{NOTEBOOK_PREFIX}_end_to_end_case_study_data.csv'}`- Main estimate: `{TABLE_DIR /f'{NOTEBOOK_PREFIX}_main_irm_estimate.csv'}`- Learner comparison: `{TABLE_DIR /f'{NOTEBOOK_PREFIX}_learner_family_comparison.csv'}`- Segment summary: `{TABLE_DIR /f'{NOTEBOOK_PREFIX}_segment_dr_signal_summary.csv'}`- Sensitivity summary: `{TABLE_DIR /f'{NOTEBOOK_PREFIX}_sensitivity_summary.csv'}`- Evidence scorecard: `{TABLE_DIR /f'{NOTEBOOK_PREFIX}_evidence_scorecard.csv'}`"""report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_end_to_end_case_study_report.md"report_path.write_text(report_text)print(f"Wrote report to: {report_path}")
Wrote report to: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/18_end_to_end_case_study_report.md
The report is short on purpose. A final case-study report should not reproduce every notebook cell. It should tell a clear story and point to the artifacts that support that story.
Artifact Manifest
The final manifest lists every major output from this notebook. This makes the case study easy to review outside the notebook interface.
The manifest is the table of contents for the case study outputs. It also makes the notebook easier to grade, review, or convert into a portfolio writeup.
Tutorial Wrap-Up
This case study pulls the DoubleML tutorial series together. The workflow is intentionally repeatable:
Define the causal question and target estimand.
Document variable timing and exclude invalid controls.
Choose the DoubleML class that matches the treatment and data structure.
Fit flexible nuisance models with cross-fitting.
Report uncertainty, overlap, nuisance diagnostics, learner robustness, and split stability.
Treat subgroup findings as exploratory unless the heterogeneity design is formal.
Use sensitivity analysis to describe, not erase, hidden-confounding risk.
End with a concise report and reviewable artifacts.
The main lesson is that DoubleML is a disciplined estimation framework. The quality of the answer still depends on the quality of the causal design.