DoubleML Tutorial 17: Common Pitfalls, Diagnostics, And Reporting
DoubleML is powerful because it separates a causal target from nuisance prediction. That separation is also where many applied analyses go wrong. A DoubleML estimate is not automatically credible just because the code runs, the learners are flexible, or the standard error is small. The estimate is credible only when the identification story, score construction, sample splitting, nuisance quality, and reporting all agree with each other.
The core partially linear regression design used throughout this notebook is:
with the causal target \(\theta_0\). DoubleML estimates \(g_0\) and \(m_0\) with machine learning, residualizes both outcome and treatment, and solves an orthogonal score. For the partialling-out score, define:
The important practical lesson is that orthogonality protects the final estimate from small nuisance errors, not from a broken design. If a confounder is missing, if a post-treatment variable is included as a control, if treatment has almost no residual variation, or if sample splitting is unstable, DoubleML will still produce a number. This notebook teaches how to detect and report those risks.
The notebook covers six applied failure modes:
Omitted confounding that cannot be fixed by flexible learners.
Bad controls and leakage, especially post-treatment variables.
Weak residual treatment variation, the continuous-treatment version of poor overlap.
Nuisance-model diagnostics that should be read as numerical checks, not causal proof.
Sample-split sensitivity and repeated fitting.
Reporting standards that make a DoubleML analysis auditable.
Setup
This setup cell imports the libraries, creates tutorial output folders, and sets a writable Matplotlib cache directory before importing plotting libraries. That small environment step keeps notebook execution clean on machines where the default Matplotlib config directory is not writable.
from pathlib import Pathimport osimport warnings# Locate the repository root even when nbconvert executes from this notebook's directory.PROJECT_ROOT = Path.cwd().resolve()while PROJECT_ROOT != PROJECT_ROOT.parent andnot (PROJECT_ROOT /"pyproject.toml").exists(): PROJECT_ROOT = PROJECT_ROOT.parentifnot (PROJECT_ROOT /"pyproject.toml").exists():raiseFileNotFoundError("Could not locate pyproject.toml; run this notebook from inside the repository.")OUTPUT_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"doubleml"/"outputs"DATASET_DIR = OUTPUT_DIR /"datasets"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"REPORT_DIR = OUTPUT_DIR /"reports"MPLCONFIG_DIR = OUTPUT_DIR /"matplotlib_cache"for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MPLCONFIG_DIR]: directory.mkdir(parents=True, exist_ok=True)os.environ["MPLCONFIGDIR"] =str(MPLCONFIG_DIR)# These filters are set before importing DoubleML because some optional notebook-widget# warnings can be emitted during import in lightweight local environments.warnings.filterwarnings("ignore", message="IProgress not found.*")warnings.filterwarnings("ignore", category=FutureWarning)import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport doubleml as dmlfrom matplotlib.patches import FancyArrowPatch, FancyBboxPatchfrom sklearn.base import clonefrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegression, LassoCVfrom sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import KFoldsns.set_theme(style="whitegrid", context="talk")RANDOM_SEED =1717NOTEBOOK_PREFIX ="17"TRUE_THETA =1.0print(f"DoubleML version: {dml.__version__}")print(f"Writing artifacts under: {OUTPUT_DIR}")
The setup confirms that the notebook can import DoubleML and write artifacts into the tutorial output folders. All files created here use prefix 17 so they are easy to connect back to this notebook.
Helper Functions
The helper functions keep the rest of the notebook readable. The most important helper is fit_plr(), which fits a DoubleMLPLR model and returns both the fitted object and a compact diagnostic row. The diagnostics intentionally combine causal and numerical information: coefficient error, confidence interval, nuisance RMSE, residual treatment variation, and the average score denominator.
def save_table(df, filename):"""Save a table into the DoubleML tutorial table folder.""" path = TABLE_DIR / filename df.to_csv(path, index=False)return pathdef save_dataset(df, filename):"""Save a dataset into the DoubleML tutorial dataset folder.""" path = DATASET_DIR / filename df.to_csv(path, index=False)return pathdef rmse(y_true, y_pred):"""Compute root mean squared error with explicit float output."""returnfloat(np.sqrt(mean_squared_error(y_true, y_pred)))def extract_prediction(model, learner_name):"""DoubleML stores predictions as n_obs x n_rep x n_treat arrays for this PLR setup."""return model.predictions[learner_name][:, 0, 0]def fit_plr(df, x_cols, learner_l, learner_m, label, true_theta=TRUE_THETA, sample_splits=None):"""Fit a PLR model and return the fitted model plus a one-row diagnostic dictionary.""" data = dml.DoubleMLData(df, y_col="outcome", d_cols="treatment", x_cols=x_cols) model = dml.DoubleMLPLR( data, ml_l=clone(learner_l), ml_m=clone(learner_m), n_folds=5, score="partialling out", )if sample_splits isnotNone: model.set_sample_splitting(sample_splits) model.fit(store_predictions=True) y_hat = extract_prediction(model, "ml_l") d_hat = extract_prediction(model, "ml_m") residualized_treatment = df["treatment"].to_numpy() - d_hat psi_a = model.psi_elements["psi_a"].reshape(-1) coef =float(model.coef[0]) se =float(model.se[0]) row = {"model": label,"n_controls": len(x_cols),"estimate": coef,"std_error": se,"ci_95_lower": coef -1.96* se,"ci_95_upper": coef +1.96* se,"true_theta": true_theta,"absolute_error": abs(coef - true_theta),"outcome_rmse": rmse(df["outcome"], y_hat),"treatment_rmse": rmse(df["treatment"], d_hat),"resid_treatment_sd": float(np.std(residualized_treatment)),"mean_resid_treatment_sq": float(np.mean(residualized_treatment**2)),"score_denominator": float(-np.mean(psi_a)), }return model, rowdef make_kfold_splits(n_obs, seed, n_splits=5):"""Create a DoubleML-compatible one-repetition list of train/test folds.""" kfold = KFold(n_splits=n_splits, shuffle=True, random_state=seed)return [(train_idx, test_idx) for train_idx, test_idx in kfold.split(np.arange(n_obs))]
The helper output will let us compare scenarios without hiding the mechanics. When a later table shows a bad estimate, the same row also shows whether nuisance prediction, residual treatment variation, or the control set changed.
Pitfall Map
Before running code, it helps to name the failure modes. DoubleML is an estimator, not an identification oracle. The table below separates problems that come from the causal design from problems that come from implementation or reporting.
pitfall_map = pd.DataFrame( [ {"pitfall": "Omitted confounding","where_it_enters": "Identification","symptom": "Estimate changes sharply when an important pre-treatment confounder is added or removed.","recommended_check": "State the adjustment set and run negative-control or sensitivity checks when possible.", }, {"pitfall": "Bad controls or leakage","where_it_enters": "Feature design","symptom": "Including post-treatment variables can shrink, reverse, or otherwise distort the causal estimate.","recommended_check": "Classify every feature by time: pre-treatment, treatment, mediator, outcome, or future signal.", }, {"pitfall": "Weak residual treatment variation","where_it_enters": "Overlap or positivity","symptom": "The residualized treatment has very small variance and standard errors inflate.","recommended_check": "Inspect residualized-treatment distributions and the score denominator.", }, {"pitfall": "Poor nuisance models","where_it_enters": "Estimation","symptom": "Nuisance RMSE is high or unstable across learners and folds.","recommended_check": "Compare learners, tune deliberately, and report nuisance diagnostics.", }, {"pitfall": "Sample-split luck","where_it_enters": "Cross-fitting","symptom": "Estimates move materially across random fold assignments.","recommended_check": "Use repeated cross-fitting or rerun over several deterministic seeds.", }, {"pitfall": "Under-reporting","where_it_enters": "Communication","symptom": "Readers cannot tell what data, learners, score, folds, or assumptions produced the estimate.","recommended_check": "Publish a compact design table, diagnostics table, and limitation statement.", }, ])save_table(pitfall_map, f"{NOTEBOOK_PREFIX}_pitfall_map.csv")display(pitfall_map)
pitfall
where_it_enters
symptom
recommended_check
0
Omitted confounding
Identification
Estimate changes sharply when an important pre...
State the adjustment set and run negative-cont...
1
Bad controls or leakage
Feature design
Including post-treatment variables can shrink,...
Classify every feature by time: pre-treatment,...
2
Weak residual treatment variation
Overlap or positivity
The residualized treatment has very small vari...
Inspect residualized-treatment distributions a...
3
Poor nuisance models
Estimation
Nuisance RMSE is high or unstable across learn...
Compare learners, tune deliberately, and repor...
4
Sample-split luck
Cross-fitting
Estimates move materially across random fold a...
Use repeated cross-fitting or rerun over sever...
5
Under-reporting
Communication
Readers cannot tell what data, learners, score...
Publish a compact design table, diagnostics ta...
The map is the checklist for the rest of the notebook. The examples below do not try to exhaust every possible failure. They show the kinds of evidence a careful analyst should collect before trusting a DoubleML result.
Simulate A Known-Truth Confounding Design
The first simulation makes confounding visible. The column latent_need acts like a strong pre-treatment confounder. We include it in the dataset so we can demonstrate what happens when it is included, omitted, or replaced with a bad post-treatment feature. In a real observational dataset, the dangerous version is worse: the confounder may be absent entirely.
The data-generating process is linear so that a well-specified baseline has a fair chance to recover the true value:
The first rows show ordinary pre-treatment controls, the teaching confounder latent_need, the intentionally invalid post_outcome_proxy, the treatment, and the outcome. The next step is to document these fields so the control-set choices are explicit instead of hidden inside model code.
Field Dictionary
Feature timing is one of the most important reporting details in causal work. A variable can be predictive and still be invalid as a control. The table below classifies each group of columns by causal role.
field_dictionary = pd.DataFrame( [ {"field_group": "x00 to x07","role": "Observed pre-treatment controls","safe_for_adjustment": "Yes","notes": "Baseline covariates measured before treatment assignment.", }, {"field_group": "latent_need","role": "Strong pre-treatment confounder used for teaching","safe_for_adjustment": "Yes, if it is genuinely observed before treatment","notes": "Omitting it creates a controlled demonstration of unmeasured-confounding bias.", }, {"field_group": "post_outcome_proxy","role": "Post-treatment leakage variable","safe_for_adjustment": "No","notes": "It is partly constructed from outcome and treatment, so adjusting for it changes the causal question.", }, {"field_group": "treatment","role": "Continuous treatment D","safe_for_adjustment": "Target treatment, not a control","notes": "The effect of this variable on outcome is the target of estimation.", }, {"field_group": "outcome","role": "Outcome Y","safe_for_adjustment": "No","notes": "This is the response variable, not a feature for the nuisance models.", }, ])save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")display(field_dictionary)
field_group
role
safe_for_adjustment
notes
0
x00 to x07
Observed pre-treatment controls
Yes
Baseline covariates measured before treatment ...
1
latent_need
Strong pre-treatment confounder used for teaching
Yes, if it is genuinely observed before treatment
Omitting it creates a controlled demonstration...
2
post_outcome_proxy
Post-treatment leakage variable
No
It is partly constructed from outcome and trea...
3
treatment
Continuous treatment D
Target treatment, not a control
The effect of this variable on outcome is the ...
4
outcome
Outcome Y
No
This is the response variable, not a feature f...
The dictionary makes the coming scenarios easier to judge. Including latent_need is valid in this teaching dataset because it is pre-treatment. Including post_outcome_proxy is not valid even though it will be highly predictive.
Design Diagram
The diagram summarizes the first simulation. Solid arrows show the intended causal and confounding structure. The dashed arrow marks the bad-control path created when a post-treatment proxy is included as if it were a baseline covariate.
The diagram is a compact reminder of the control-set rule. Valid controls are measured before treatment and help block backdoor paths. A post-treatment proxy is not made valid by being predictive.
Fit Control-Set Scenarios
Now we fit four versions of the same PLR analysis. The only thing that changes is the feature set. This isolates a central lesson: DoubleML can only adjust for the variables we give it, and giving it the wrong variables can be worse than giving it too few.
The valid-control model is closest to the known truth. Omitting the strong confounder pushes the estimate upward, while adding a post-treatment proxy severely distorts the answer. The table is deliberately blunt: accurate prediction and valid adjustment are different goals.
Plot Control-Set Estimates
A plot makes the control-set risk easier to scan. The vertical line marks the true effect used by the simulator, and each interval is a normal 95% confidence interval from the fitted DoubleML model.
The visual gap between scenarios is the main lesson. A narrow interval around a biased estimate is still a bad causal result. Reporting only one preferred specification would hide the fragility created by the control set.
Residual And Nuisance Diagnostics
DoubleML estimates rely on nuisance models, but nuisance diagnostics should be read carefully. Low RMSE is useful; it is not proof that the adjustment set is valid. This cell compares nuisance prediction quality and residualized-treatment variation across the control-set scenarios.
The bad post-treatment control can look numerically attractive because it predicts outcome-related variation. That is exactly why timing metadata matters. A diagnostic can tell us how a learner behaved; it cannot decide whether a feature belongs in the causal design.
Weak Residual Treatment Variation
For a continuous treatment, overlap shows up as residual treatment variation after adjusting for controls. If the treatment is almost deterministic given the controls, then \(\hat{v}_i = D_i - \hat{m}(X_i)\) is tiny. The score denominator becomes small, and the estimate can become noisy.
This simulation changes only the treatment-noise scale. Smaller treatment noise means weaker residual variation.
As residual treatment variation gets smaller, the standard error grows and the estimate becomes more fragile. This is not a learner failure; it is a design problem. There is not enough treatment movement left after adjustment to estimate the slope precisely.
Plot Residual Treatment Variation
The next plot shows the distribution of residualized treatment for each noise level. This is the continuous-treatment diagnostic behind the previous table.
The narrowest curve corresponds to the weakest residual variation. A DoubleML report should include this kind of diagnostic when the treatment is continuous or when treatment assignment is highly predictable from controls.
Nuisance Learner Comparison
The next simulation is nonlinear. The goal is not to prove that one learner is always best. The goal is to show how to compare nuisance learners without mistaking predictive performance for identification. We compare a simple linear learner with a random forest on the same known-truth design.
The random forest improves nuisance prediction in this nonlinear design. The causal estimate is still judged against the known truth, not against RMSE alone. In real data, we do not know the true effect, so this table becomes evidence about numerical plausibility rather than final validation.
Plot Learner Diagnostics
This plot puts the nuisance-model comparison into one view: treatment-effect error on one axis and nuisance RMSE on the other. It is a compact way to discuss whether better nuisance prediction also produced a more stable target estimate.
The two panels should be read together. Strong nuisance models are helpful, but the report should not claim that a lower prediction error proves the causal estimate. Prediction diagnostics support the estimation story; they do not replace the assumptions.
Sample-Split Sensitivity
Cross-fitting uses fold assignments. A single random split is usually not the whole story. This cell refits the same valid-control design over multiple fixed fold seeds and records how much the estimate moves.
The split summary tells us whether one fold assignment was unusually lucky or unlucky. In this clean simulation, the repeated estimates should be fairly stable. In noisier real applications, this can be one of the most useful robustness checks.
Plot Sample-Split Stability
The next plot shows each repeated estimate and the known true effect. This is a simple visual habit that makes cross-fitting randomness visible.
The estimates cluster tightly in this example. If the dots jumped across substantively different values, the report should say so and either use repeated cross-fitting or explain why the design is unstable.
Diagnostic Scorecard
A good DoubleML report should not bury diagnostics in scattered notebook cells. This scorecard converts the previous examples into a compact review table. The goal is to make it easy to see which risks were checked, what evidence was produced, and what remains unresolved.
valid_row = control_results.loc[control_results["model"].eq("Valid controls including latent_need")].iloc[0]omitted_row = control_results.loc[control_results["model"].eq("Omitted latent_need")].iloc[0]bad_control_row = control_results.loc[control_results["model"].eq("Bad post-treatment control")].iloc[0]weak_overlap_row = overlap_results.loc[overlap_results["treatment_noise_sd"].eq(0.15)].iloc[0]scorecard = pd.DataFrame( [ {"risk_area": "Confounding","evidence_from_notebook": f"Omitting latent_need changed estimate from {valid_row['estimate']:.3f} to {omitted_row['estimate']:.3f}.","status": "Requires design justification","reporting_action": "State adjustment set and discuss unavailable confounders.", }, {"risk_area": "Bad controls","evidence_from_notebook": f"Adding post_outcome_proxy moved estimate to {bad_control_row['estimate']:.3f}.","status": "High risk if included","reporting_action": "Classify variables by measurement time and exclude post-treatment controls.", }, {"risk_area": "Residual treatment variation","evidence_from_notebook": f"Weakest scenario residual SD is {weak_overlap_row['resid_treatment_sd']:.3f} with SE {weak_overlap_row['std_error']:.3f}.","status": "Check required","reporting_action": "Show residualized-treatment distribution or equivalent overlap diagnostic.", }, {"risk_area": "Nuisance quality","evidence_from_notebook": "Linear and forest learners were compared on a nonlinear design.","status": "Model-dependent","reporting_action": "Report learner settings, nuisance RMSE, and tuning choices.", }, {"risk_area": "Sample splitting","evidence_from_notebook": f"Repeated split SD is {split_summary.loc[0, 'sd_estimate']:.4f}.","status": "Stable in this simulation","reporting_action": "Report fold count, repetition count, and random seeds.", }, {"risk_area": "Communication","evidence_from_notebook": "Tables and figures are saved as reviewable artifacts.","status": "Report explicitly","reporting_action": "Include assumptions, diagnostics, limitations, and artifact paths.", }, ])save_table(scorecard, f"{NOTEBOOK_PREFIX}_diagnostic_scorecard.csv")display(scorecard)
risk_area
evidence_from_notebook
status
reporting_action
0
Confounding
Omitting latent_need changed estimate from 1.0...
Requires design justification
State adjustment set and discuss unavailable c...
1
Bad controls
Adding post_outcome_proxy moved estimate to 0....
High risk if included
Classify variables by measurement time and exc...
2
Residual treatment variation
Weakest scenario residual SD is 0.153 with SE ...
Check required
Show residualized-treatment distribution or eq...
3
Nuisance quality
Linear and forest learners were compared on a ...
Model-dependent
Report learner settings, nuisance RMSE, and tu...
4
Sample splitting
Repeated split SD is 0.0017.
Stable in this simulation
Report fold count, repetition count, and rando...
5
Communication
Tables and figures are saved as reviewable art...
Report explicitly
Include assumptions, diagnostics, limitations,...
The scorecard is intentionally plain. A reader should be able to understand the main risks without rerunning the entire notebook. The most important unresolved item in real data is usually confounding that was not measured.
Plot Diagnostic Scorecard
A small heatmap-style display can help summarize which risks are resolved, checked, or still dependent on assumptions. This is not a statistical test. It is a communication device for review.
The heatmap keeps the report honest: some items can be checked numerically, while others remain design assumptions. The red cells are not failures by themselves; they are places where a serious report needs careful language.
Reporting Template
This cell writes a reusable Markdown report template. It is deliberately structured around the things that usually go missing: target estimand, feature timing, learner settings, split design, diagnostics, sensitivity checks, and limitations.
report_template =f"""# DoubleML Diagnostics And Reporting Template## 1. Causal QuestionState the treatment, outcome, population, and target estimand. For PLR, write whether the target is a constant marginal effect like $\theta_0$.## 2. Identification AssumptionsDescribe the adjustment set, timing of all controls, unconfoundedness assumptions, overlap or residual treatment variation, and any reasons these assumptions may fail.## 3. Data And Feature TimingList every feature group and classify it as pre-treatment, treatment, mediator, outcome, post-treatment, or future information. Exclude post-treatment controls from the main design.## 4. DoubleML SpecificationReport the DoubleML class, score, learners for each nuisance function, fold count, repeated-split count, random seeds, and whether sample splitting was supplied externally.## 5. Main EstimateReport estimate, standard error, confidence interval, and sample size. Explain the units of treatment and outcome.## 6. DiagnosticsInclude nuisance RMSE, residualized-treatment variation, score denominator, sample-split stability, and any learner-comparison results.## 7. Robustness And SensitivityDiscuss omitted-confounder risk, alternative adjustment sets, weak-overlap checks, and whether the estimate changes under defensible specifications.## 8. LimitationsState what the analysis cannot prove. DoubleML does not repair missing confounders, bad controls, interference, measurement error, or target mismatch.## 9. Artifact Paths From This Notebook- Pitfall map: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_pitfall_map.csv'}- Control-set scenarios: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_control_set_scenarios.csv'}- Residual treatment variation: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_residual_treatment_variation.csv'}- Nuisance learner comparison: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_nuisance_learner_comparison.csv'}- Sample-split summary: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_sample_split_summary.csv'}- Diagnostic scorecard: {TABLE_DIR /f'{NOTEBOOK_PREFIX}_diagnostic_scorecard.csv'}"""report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_diagnostics_reporting_template.md"report_path.write_text(report_template)print(f"Wrote report template to: {report_path}")
Wrote report template to: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/17_diagnostics_reporting_template.md
The template is meant to be copied forward into real analyses. It also makes the tutorial concrete: every major claim made by the notebook has a corresponding artifact path.
Artifact Manifest
The final table lists the datasets, tables, figures, and report generated by this notebook. This is useful when reviewing the tutorial later or when turning it into a portfolio-style writeup.
The manifest closes the loop between notebook analysis and reviewable outputs. A strong causal notebook should leave behind more than printed estimates; it should leave behind enough structure for another person to audit the design.
What Comes Next
The next tutorial is the end-to-end DoubleML case study. That notebook should combine the lessons from this series: clear causal question, correct data backend, suitable estimator class, deliberate learners, cross-fitting diagnostics, uncertainty, sensitivity, pitfalls, and a concise final report.
The main lesson from this notebook is simple: DoubleML is not magic. It is a disciplined way to estimate orthogonal scores after you have done the causal design work. The best analysts do not just produce an estimate; they show why that estimate deserves attention and where it could still fail.