DoubleML Tutorial 01: DML Theory, Orthogonalization, And Cross-Fitting
This notebook explains the theory that makes DoubleML useful. The package is easiest to trust when the mechanics are not mysterious: nuisance functions are estimated with machine learning, the target causal parameter is estimated through an orthogonal score, and cross-fitting keeps nuisance prediction honest.
The running example is a partially linear regression design:
\[
Y = \theta_0 D + g_0(X) + \varepsilon,
\]
where Y is the outcome, D is a treatment or exposure, X is a vector of controls, g_0(X) is an unknown outcome nuisance function, and theta_0 is the causal parameter we want. Treatment assignment is also modeled as
\[
D = m_0(X) + V.
\]
DoubleML estimates g_0(X) and m_0(X) flexibly, but it does not treat those prediction models as the final goal. The final goal is the low-dimensional causal parameter theta_0 and its uncertainty.
Estimated runtime: about 1 minute on a typical laptop.
Learning Goals
By the end of this notebook, you should be able to:
explain why naive treatment-outcome regression can be biased under confounding;
explain why plugging machine-learning predictions into causal estimators can create regularization bias;
derive the partially linear orthogonal score at an intuitive level;
describe why Neyman orthogonality reduces sensitivity to nuisance-model mistakes;
implement manual cross-fitting for a PLR score;
distinguish DML1-style fold averaging from DML2-style pooled score solving;
connect the manual calculations to DoubleMLPLR.
Tutorial Flow
The notebook proceeds in seven steps:
set up the environment and output folders;
simulate a high-dimensional confounded PLR dataset with known truth;
compare naive and adjusted baselines;
introduce orthogonal residual scores;
demonstrate nuisance perturbation robustness;
implement sample splitting and cross-fitting manually;
compare manual DML1, manual DML2, and DoubleMLPLR.
Setup
This cell imports the packages used in the notebook, prepares output directories, and suppresses known non-substantive notebook warnings. We set MPLCONFIGDIR before importing plotting libraries so Matplotlib cache files stay inside the project outputs folder.
The setup confirms the environment and creates a stable place for saved artifacts. This notebook uses LassoCV for the main nuisance learners because the synthetic design below is sparse and high-dimensional.
Version Table
Theory notebooks still benefit from version logging. The exact learner behavior, defaults, and output formatting can change across package versions.
from importlib import metadatapackages = ["doubleml", "numpy", "pandas", "scikit-learn", "matplotlib", "seaborn"]version_table = []for package in packages:try: version = metadata.version(package)except metadata.PackageNotFoundError: version =None version_table.append({"package": package, "version": version})version_table = pd.DataFrame(version_table)version_table.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)display(version_table)
package
version
0
doubleml
0.11.2
1
numpy
2.4.4
2
pandas
3.0.2
3
scikit-learn
1.6.1
4
matplotlib
3.10.9
5
seaborn
0.13.2
This small table gives the notebook a reproducibility anchor. If a future run changes slightly, the package versions are one of the first things to check.
Theory Map
Before coding, we summarize the main theory concepts. Each row in the table will appear again in executable form later in the notebook.
theory_map = pd.DataFrame( [ {"concept": "confounding","plain_language_role": "Controls affect both treatment and outcome, so naive treatment-outcome association is not causal.","where_it_appears_below": "naive baseline estimate", }, {"concept": "nuisance function","plain_language_role": "A helper prediction function such as E[Y | X] or E[D | X].","where_it_appears_below": "Lasso nuisance models", }, {"concept": "regularization bias","plain_language_role": "Bias caused when regularized nuisance prediction errors leak into the causal estimate.","where_it_appears_below": "in-sample and memorizing learner demos", }, {"concept": "orthogonal score","plain_language_role": "A score designed so small nuisance mistakes have reduced first-order impact on theta.","where_it_appears_below": "residualized PLR score", }, {"concept": "cross-fitting","plain_language_role": "Train nuisance functions on one fold and score held-out observations on another fold.","where_it_appears_below": "manual K-fold DML implementation", }, {"concept": "DML1 versus DML2","plain_language_role": "DML1 averages fold-specific estimates; DML2 solves one pooled score after cross-fitting.","where_it_appears_below": "manual fold table and pooled estimate", }, ])theory_map.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_theory_map.csv", index=False)display(theory_map)
concept
plain_language_role
where_it_appears_below
0
confounding
Controls affect both treatment and outcome, so...
naive baseline estimate
1
nuisance function
A helper prediction function such as E[Y | X] ...
Lasso nuisance models
2
regularization bias
Bias caused when regularized nuisance predicti...
in-sample and memorizing learner demos
3
orthogonal score
A score designed so small nuisance mistakes ha...
residualized PLR score
4
cross-fitting
Train nuisance functions on one fold and score...
manual K-fold DML implementation
5
DML1 versus DML2
DML1 averages fold-specific estimates; DML2 so...
manual fold table and pooled estimate
The common thread is separation. DoubleML separates causal target estimation from nuisance prediction, and cross-fitting separates nuisance training rows from score-evaluation rows.
Simulating A Confounded PLR Dataset
We use a synthetic dataset because it lets us know the true treatment effect. The controls X affect both treatment D and outcome Y, so a naive regression of Y on D will not recover the true effect.
The design is sparse and high-dimensional: there are many controls, but only the first few matter. That gives us a natural setting for regularized nuisance learners.
The observed data has the columns an analyst would see: outcome, treatment, and controls. The oracle nuisance columns are saved separately only for teaching; real datasets do not include true nuisance functions.
The next cell documents the variable roles. This matters because DoubleML uses column roles to define the score. A role mistake is not a small syntax issue; it changes the estimand.
Only y, d, and the x columns will be passed to DoubleML. The oracle columns are useful for the theory demonstrations but would be unavailable in an applied analysis.
Why Naive Regression Fails
The simplest mistake is to regress outcome on treatment while ignoring controls. Because the controls affect both treatment and outcome, the treatment coefficient absorbs part of the control effect.
We compare three estimates:
naive regression of y on d only;
full linear regression of y on d and all controls;
oracle residual score using the true nuisance functions, available only because this is synthetic data.
The naive estimate is far from the true effect because it ignores confounding. The oracle residual score is close because it removes the true control-driven parts of outcome and treatment before estimating the treatment effect.
This plot shows the same comparison visually. The vertical dashed line is the true effect, available only in this controlled teaching example.
The visual makes the confounding problem concrete. DoubleML is designed for the middle ground where the oracle functions are unknown, but flexible learners can estimate useful nuisance functions from controls.
The Orthogonal PLR Score
The partialling-out score for PLR can be written as:
The key is not just residualization. The key is that the score is orthogonal: near the truth, small errors in l or m have reduced first-order impact on theta.
The next table turns the score into a set of operational pieces. These are the pieces that DoubleML automates internally.
score_pieces = pd.DataFrame( [ {"piece": "Y - l_hat(X)", "role": "outcome residual", "why_it_matters": "Removes the part of the outcome explained by controls."}, {"piece": "D - m_hat(X)", "role": "treatment residual", "why_it_matters": "Removes the part of treatment assignment explained by controls."}, {"piece": "mean(treatment_residual * outcome_residual)", "role": "score numerator", "why_it_matters": "Measures remaining treatment-outcome movement after residualization."}, {"piece": "mean(treatment_residual ** 2)", "role": "score denominator", "why_it_matters": "Measures remaining treatment variation after adjustment."}, {"piece": "numerator / denominator", "role": "theta estimate", "why_it_matters": "Solves the sample analog of the orthogonal moment condition."}, ])score_pieces.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_score_pieces.csv", index=False)display(score_pieces)
piece
role
why_it_matters
0
Y - l_hat(X)
outcome residual
Removes the part of the outcome explained by c...
1
D - m_hat(X)
treatment residual
Removes the part of treatment assignment expla...
2
mean(treatment_residual * outcome_residual)
score numerator
Measures remaining treatment-outcome movement ...
3
mean(treatment_residual ** 2)
score denominator
Measures remaining treatment variation after a...
4
numerator / denominator
theta estimate
Solves the sample analog of the orthogonal mom...
This table is the simplest mental model for PLR DML: estimate nuisance functions, residualize outcome and treatment, then regress the residualized outcome on the residualized treatment.
Orthogonality By Perturbation
Orthogonality can feel abstract, so we demonstrate it numerically. We start from the true nuisance functions and add controlled perturbations. Then we compare:
the orthogonal residual score, which residualizes both outcome and treatment;
a non-orthogonal residual score, which residualizes the outcome but does not residualize the treatment.
The orthogonal score should move more slowly as the nuisance perturbation grows.
The orthogonal estimate changes more slowly near zero perturbation. The point is not that nuisance errors do not matter. They do. The point is that the score is designed to reduce first-order sensitivity around the correct nuisance functions.
The next plot shows the perturbation result. A flatter curve near zero is the numerical fingerprint of the orthogonal score.
The less protected score reacts more sharply to the same nuisance perturbation. This is why orthogonal scores are central to DoubleML.
Regularization Bias And Memorizing Learners
Orthogonality helps with nuisance errors, but it does not license careless prediction. If a nuisance learner memorizes the training data and we evaluate it on the same rows, residuals can become artificially tiny. That can make the residual score unstable or meaningless.
This cell compares in-sample and cross-fitted residualization using a deliberately memorizing 1-nearest-neighbor learner.
The in-sample 1-nearest-neighbor learner nearly memorizes the outcome and treatment, leaving almost no residual treatment variation. Cross-fitting prevents the exact same row from being used for its own nuisance prediction, so the residual score becomes defined again, although the learner is still not a great choice here.
Cross-Fitting With A Sensible Sparse Learner
Now we use LassoCV, which matches the sparse linear structure of the synthetic data. We compute nuisance predictions both in-sample and out-of-fold so the difference is explicit.
The cross-fitted estimate is based on out-of-fold predictions. Its nuisance R^2 values are lower than the in-sample values because held-out prediction is harder, but that honesty is exactly the point.
The next figure shows observed versus cross-fitted nuisance predictions. We are not trying to maximize predictive performance at all costs; we are checking that the nuisance learners capture meaningful control-driven signal.
fig, axes = plt.subplots(1, 2, figsize=(12, 4.8))sns.scatterplot(x=cf_l_lasso, y=y_array, s=22, alpha=0.55, color="#2563eb", ax=axes[0])axes[0].set_title("Outcome Nuisance: Cross-Fitted Prediction")axes[0].set_xlabel("Predicted y from x")axes[0].set_ylabel("Observed y")sns.scatterplot(x=cf_m_lasso, y=d_array, s=22, alpha=0.55, color="#16a34a", ax=axes[1])axes[1].set_title("Treatment Nuisance: Cross-Fitted Prediction")axes[1].set_xlabel("Predicted d from x")axes[1].set_ylabel("Observed d")plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_cross_fitted_lasso_nuisance_predictions.png", dpi=160, bbox_inches="tight")plt.show()
Both nuisance predictions contain signal. The treatment nuisance is particularly important because residual treatment variation is what identifies the partially linear effect after adjustment.
Sample Splitting Mechanics
Cross-fitting uses repeated train and held-out roles. This cell records the fold sizes used in the manual calculations above. Every observation appears in exactly one held-out fold for this single split.
The dark cells move across folds so each observation gets one held-out nuisance prediction. That held-out prediction is what enters the orthogonal score.
DML1 Versus DML2
DML1 and DML2 use the same basic score but aggregate folds differently.
DML1: solve the score separately in each fold, then average the fold-specific estimates.
DML2: stack all cross-fitted residuals and solve one pooled score.
Many modern DoubleML workflows report a DML2-style pooled estimate. We compute both manually here because the distinction helps clarify what cross-fitting is doing.
The fold estimates vary because each held-out fold has its own residual distribution. DML1 averages those fold estimates; DML2 solves one pooled score across all held-out predictions.
The next plot shows fold-level variation and compares it with the pooled DML2 estimate and the true synthetic effect.
Fold variability is a useful reminder that sample splitting adds randomness. Later notebooks revisit repeated cross-fitting and split sensitivity more carefully.
Connecting Manual DML To DoubleMLPLR
Now we fit DoubleMLPLR using the same observed dataset and a Lasso nuisance learner. The package handles sample splitting, nuisance fitting, score solving, standard errors, and confidence intervals.
The DoubleML estimate is close to the manual cross-fitted residual estimate because both are solving the same PLR score idea. DoubleML additionally reports standard errors, p-values, and confidence intervals.
We collect all major estimates in one comparison table. This is a useful reporting habit because it separates confounding bias, adjustment behavior, and DML behavior.
The table shows the main lesson of the notebook: the naive estimate is not a causal estimate in this design, while the residualized and cross-fitted estimates aim at the true treatment effect.
The comparison plot places all estimates against the true synthetic effect. This type of plot is helpful in simulations because it shows which procedures are targeting the right quantity.
The cross-fitted DML estimates cluster near the true value. The naive estimate remains far away because it does not address the confounding built into the simulation.
Inspecting Score Elements
DoubleML stores score-related arrays after fitting. These are advanced internal outputs, but seeing their shapes helps connect the package object to the theory. For one treatment and one repeated split, the score arrays have an observation dimension plus treatment and repetition dimensions.
score_diagnostics = pd.DataFrame( [ {"object": "psi", "shape": str(np.asarray(plr_model.psi).shape), "description": "Score values evaluated at the fitted theta."}, {"object": "psi_deriv", "shape": str(np.asarray(plr_model.psi_deriv).shape), "description": "Score derivative values used for standard errors."}, ])score_diagnostics.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_score_diagnostics.csv", index=False)display(score_diagnostics)
object
shape
description
0
psi
(500, 1, 1)
Score values evaluated at the fitted theta.
1
psi_deriv
(500, 1, 1)
Score derivative values used for standard errors.
Most users do not need to work directly with these arrays, but they are a useful bridge between the mathematical score and the fitted DoubleML object.
Nuisance Losses And Prediction Quality
The final theory point is that nuisance quality matters, even with orthogonal scores. Orthogonality reduces first-order sensitivity, but very poor nuisance models can still hurt finite-sample performance.
DoubleML reports nuisance losses. We also compute simple out-of-fold prediction quality from the stored nuisance predictions.
The loss table gives a quick check of nuisance fit. These losses should be read as diagnostics, not as the causal estimand.
This cell reads DoubleML’s stored nuisance predictions and computes simple RMSE and R2 diagnostics for each nuisance role.
doubleml_pred_l = np.asarray(plr_model.predictions["ml_l"]).squeeze()doubleml_pred_m = np.asarray(plr_model.predictions["ml_m"]).squeeze()prediction_quality = pd.DataFrame( [ {"nuisance_role": "ml_l predicts y from x","rmse": mean_squared_error(y_array, doubleml_pred_l) **0.5,"r2": r2_score(y_array, doubleml_pred_l), }, {"nuisance_role": "ml_m predicts d from x","rmse": mean_squared_error(d_array, doubleml_pred_m) **0.5,"r2": r2_score(d_array, doubleml_pred_m), }, ])prediction_quality.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_doubleml_nuisance_prediction_quality.csv", index=False)display(prediction_quality.round(4))
nuisance_role
rmse
r2
0
ml_l predicts y from x
1.4449
0.7859
1
ml_m predicts d from x
1.0515
0.5640
The nuisance learners capture meaningful signal from the controls. That is what we need: not perfect prediction, but useful residualization for the target causal score.
Reporting Checklist For DML Theory
A theory notebook still needs a reporting checklist. These are the items that should appear whenever you use DoubleML in a real analysis.
reporting_checklist = pd.DataFrame( [ {"check": "Target parameter named", "status_here": "theta in a PLR design", "why_it_matters": "The estimand must be clear before choosing learners."}, {"check": "Identification assumptions stated", "status_here": "controls are sufficient by construction in synthetic data", "why_it_matters": "DoubleML does not create identification by itself."}, {"check": "Nuisance roles documented", "status_here": "ml_l for outcome nuisance and ml_m for treatment nuisance", "why_it_matters": "Different DoubleML classes require different nuisance roles."}, {"check": "Cross-fitting plan documented", "status_here": "5-fold split with fixed random seed", "why_it_matters": "Resampling choices can affect finite-sample estimates."}, {"check": "Baselines compared", "status_here": "naive, linear adjusted, oracle, manual DML, DoubleML", "why_it_matters": "Baselines reveal what the DML workflow is correcting."}, {"check": "Nuisance diagnostics reported", "status_here": "losses and out-of-fold prediction quality saved", "why_it_matters": "Poor nuisance fit can still matter."}, {"check": "Uncertainty reported", "status_here": "DoubleML standard errors and confidence interval saved", "why_it_matters": "A causal estimate without uncertainty is incomplete."}, ])reporting_checklist.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_reporting_checklist.csv", index=False)display(reporting_checklist)
check
status_here
why_it_matters
0
Target parameter named
theta in a PLR design
The estimand must be clear before choosing lea...
1
Identification assumptions stated
controls are sufficient by construction in syn...
DoubleML does not create identification by its...
2
Nuisance roles documented
ml_l for outcome nuisance and ml_m for treatme...
Different DoubleML classes require different n...
3
Cross-fitting plan documented
5-fold split with fixed random seed
Resampling choices can affect finite-sample es...
4
Baselines compared
naive, linear adjusted, oracle, manual DML, Do...
Baselines reveal what the DML workflow is corr...
5
Nuisance diagnostics reported
losses and out-of-fold prediction quality saved
Poor nuisance fit can still matter.
6
Uncertainty reported
DoubleML standard errors and confidence interv...
A causal estimate without uncertainty is incom...
The checklist separates theory, implementation, and reporting. That separation is the habit that makes DoubleML work credible.
Artifact Manifest
The final cell records every file created by the notebook. This makes the run easier to audit and keeps later notebooks organized.