DoubleML Tutorial 10: Learners, Hyperparameters, And Tuning
This notebook is about the practical part of DoubleML that often decides whether an analysis feels robust or fragile: choosing nuisance learners and tuning them without letting the causal estimate become a model-selection toy.
Double machine learning separates two jobs that are easy to blur:
The causal estimand is defined by the research design: treatment, outcome, controls, timing, identification assumptions, and the target population.
The machine-learning learners estimate nuisance functions that make the orthogonal score usable: outcome regressions, treatment regressions, propensities, instruments, selection probabilities, or other auxiliary quantities depending on the model class.
In a partially linear regression model, the target is usually written as
[ Y = D + g_0(X) + , ]
with a treatment equation
[ D = m_0(X) + V. ]
DoubleML estimates the nuisance functions (g_0(X)) and (m_0(X)), residualizes (Y) and (D), and then estimates () from an orthogonal score. The important point is that better nuisance prediction can help, but the final goal is not to win a prediction contest. The final goal is a credible estimate of () under valid identifying assumptions.
That creates a practical tension. Flexible models such as boosted trees can capture nonlinear confounding patterns, but they also introduce many hyperparameters. Tuning can improve nuisance quality, but careless tuning can leak information from the final estimate back into model choice. This notebook teaches a workflow that is useful in real analysis:
Build a numeric DoubleML backend.
Compare sensible learner families under common sample splits.
Use pipelines where preprocessing is part of the learner.
Tune hyperparameters with DoubleML’s Optuna-based API.
Separate safe model-selection criteria from unsafe causal-estimate shopping.
Expected runtime: usually under one minute on this synthetic dataset. The Optuna section intentionally uses a small number of trials so the notebook stays tutorial-friendly.
Setup
This cell creates the same output folders used across the DoubleML tutorial series, sets a local Matplotlib cache to avoid environment warnings, and imports the packages used in the notebook. The warning filters are intentionally narrow: they suppress known package-level noise while keeping genuine modeling problems visible.
The setup confirms that the notebook is running against the local environment and that all artifacts will be written into the DoubleML tutorial output folder. Keeping outputs organized by notebook prefix makes it much easier to review the tutorial later.
Helper Functions
The helper functions below keep the notebook focused on causal reasoning instead of repetitive file and formatting code. There are three especially important helpers:
make_plr_data() creates a DoubleML-ready backend for a given dataframe.
fit_plr_with_learner() fits a PLR model with common cross-fitting logic and returns both the fitted object and a compact summary row.
rmse_metric() evaluates nuisance predictions from DoubleML’s stored cross-fitted predictions.
The metric function is written defensively because some DoubleML model classes can contain missing nuisance targets. PLR does not usually create missing targets, but keeping the helper general makes the pattern reusable.
def save_table(df, filename): path = TABLE_DIR / filename df.to_csv(path, index=False)return pathdef save_dataset(df, filename): path = DATASET_DIR / filename df.to_csv(path, index=False)return pathdef rmse_metric(y_true, y_pred): mask =~np.isnan(y_true)return mean_squared_error(y_true[mask], y_pred[mask]) **0.5def mae_metric(y_true, y_pred): mask =~np.isnan(y_true)return mean_absolute_error(y_true[mask], y_pred[mask])def model_x_cols(df): excluded = {"unit_id", "outcome", "treatment", "true_g", "true_m"}return [col for col in df.columns if col notin excluded]def make_plr_data(df):return DoubleMLData(df, y_col="outcome", d_cols="treatment", x_cols=model_x_cols(df))def make_common_splits(df, n_folds=5, seed=RANDOM_SEED): splitter = KFold(n_splits=n_folds, shuffle=True, random_state=seed)returnlist(splitter.split(df))def fit_plr_with_learner(df, learner_name, ml_l, ml_m=None, n_folds=5, seed=RANDOM_SEED):if ml_m isNone: ml_m = ml_l dml_data = make_plr_data(df) sample_splits = make_common_splits(df, n_folds=n_folds, seed=seed) plr = DoubleMLPLR( dml_data, ml_l=clone(ml_l), ml_m=clone(ml_m), n_folds=n_folds, draw_sample_splitting=False, ) plr.set_sample_splitting(sample_splits) start = time.perf_counter() plr.fit() runtime_seconds = time.perf_counter() - start learner_rmse = plr.evaluate_learners(metric=rmse_metric) learner_mae = plr.evaluate_learners(metric=mae_metric) row = {"learner": learner_name,"theta_hat": float(plr.coef[0]),"se": float(plr.se[0]),"ci_95_lower": float(plr.confint().iloc[0, 0]),"ci_95_upper": float(plr.confint().iloc[0, 1]),"bias_vs_true": float(plr.coef[0] - TRUE_THETA),"abs_bias_vs_true": float(abs(plr.coef[0] - TRUE_THETA)),"rmse_ml_l": float(learner_rmse["ml_l"][0, 0]),"rmse_ml_m": float(learner_rmse["ml_m"][0, 0]),"mae_ml_l": float(learner_mae["ml_l"][0, 0]),"mae_ml_m": float(learner_mae["ml_m"][0, 0]),"avg_nuisance_rmse": float((learner_rmse["ml_l"][0, 0] + learner_rmse["ml_m"][0, 0]) /2),"runtime_seconds": runtime_seconds, }return plr, rowdef learner_family_table():return pd.DataFrame( [ {"family": "Penalized linear models","examples": "RidgeCV, LassoCV","strength": "Fast, stable, strong when nuisance functions are close to linear after feature engineering.","watch_out": "Can underfit nonlinear confounding and make residualization too crude.", }, {"family": "Random forests","examples": "RandomForestRegressor","strength": "Good default for nonlinearities and interactions with limited tuning.","watch_out": "Can be slower and less smooth; shallow/deep choices matter.", }, {"family": "Histogram gradient boosting","examples": "HistGradientBoostingRegressor","strength": "Strong sklearn-native boosted tree option with good speed.","watch_out": "Needs regularization through depth, leaves, learning rate, and minimum leaf size.", }, {"family": "External boosted trees","examples": "LightGBM, XGBoost","strength": "Flexible, fast, and common in applied data science workflows.","watch_out": "Large search spaces make tuning discipline important.", }, ] )
These helpers encode two habits that matter in DoubleML work. First, the same sample split is reused across learner comparisons, so differences are less likely to be split noise. Second, nuisance quality is summarized next to the treatment-effect estimate rather than treated as a separate modeling exercise.
Learner Roles In DoubleML
Before we simulate data, it is worth naming what each learner is trying to learn. For PLR, ml_l predicts the outcome from controls and ml_m predicts treatment from controls. They are nuisance learners because they are not the estimand. They are tools used to form an orthogonal score.
This table is deliberately conceptual. A strong DoubleML notebook should make clear what each model is estimating before showing a single fitted coefficient.
nuisance_role_table = pd.DataFrame( [ {"DoubleML model": "PLR","learner": "ml_l","target": "E[Y | X]","why it matters": "Removes outcome variation explained by observed controls.", }, {"DoubleML model": "PLR","learner": "ml_m","target": "E[D | X]","why it matters": "Removes treatment variation explained by observed controls.", }, {"DoubleML model": "IRM","learner": "ml_g","target": "E[Y | D, X]","why it matters": "Builds potential-outcome regressions for treated and control states.", }, {"DoubleML model": "IRM/IIVM","learner": "ml_m","target": "P[D = 1 | X] or P[Z = 1 | X]","why it matters": "Controls weighting and overlap behavior through propensity-style models.", }, ])save_table(nuisance_role_table, f"{NOTEBOOK_PREFIX}_nuisance_role_table.csv")display(nuisance_role_table)
DoubleML model
learner
target
why it matters
0
PLR
ml_l
E[Y | X]
Removes outcome variation explained by observe...
1
PLR
ml_m
E[D | X]
Removes treatment variation explained by obser...
2
IRM
ml_g
E[Y | D, X]
Builds potential-outcome regressions for treat...
3
IRM/IIVM
ml_m
P[D = 1 | X] or P[Z = 1 | X]
Controls weighting and overlap behavior throug...
The key lesson is that learners are chosen for the nuisance task, not because they sound advanced. If the treatment model is hard and the outcome model is easy, it can be reasonable to use different learners for ml_l and ml_m.
Synthetic PLR Design
We now create a synthetic PLR dataset with nonlinear confounding. The treatment is continuous, the outcome is continuous, and the true treatment effect is known: TRUE_THETA = 1.30.
The data intentionally contains raw categorical fields such as segment, channel, and tenure_band. This lets us teach a practical preprocessing point: DoubleML’s backend expects the control matrix to be numeric, so categorical variables should be encoded before creating DoubleMLData, or handled in a way that still presents numeric controls to DoubleML.
The first rows show the teaching data before encoding. The columns true_m and true_g are included only because this is a simulation; they let us build an oracle benchmark later. They must not be used as controls in the DoubleML backend.
Data Dictionary And Audit
A learner-tuning notebook should still start with data roles. Hyperparameter tuning does not rescue a confused design. This cell records which columns are identifiers, raw controls, oracle-only simulation columns, treatment, and outcome.
field_dictionary = pd.DataFrame( [ {"column": "unit_id", "role": "identifier", "description": "Synthetic row identifier; excluded from modeling."}, {"column": "x00-x11", "role": "observed controls", "description": "Numeric pre-treatment controls used to create confounding."}, {"column": "segment", "role": "observed control", "description": "Categorical control derived from x00 with noise."}, {"column": "channel", "role": "observed control", "description": "Categorical acquisition/source-style control."}, {"column": "tenure_band", "role": "observed control", "description": "Categorical tenure-style control derived from x01 with noise."}, {"column": "true_m", "role": "oracle only", "description": "True treatment nuisance E[D|X] used only for simulation diagnostics."}, {"column": "true_g", "role": "oracle only", "description": "True outcome nuisance component g0(X) used only for simulation diagnostics."}, {"column": "treatment", "role": "treatment", "description": "Continuous treatment D."}, {"column": "outcome", "role": "outcome", "description": "Continuous outcome Y."}, ])raw_audit = pd.DataFrame( {"n_rows": [len(raw_df)],"n_columns": [raw_df.shape[1]],"missing_cells": [int(raw_df.isna().sum().sum())],"true_theta": [TRUE_THETA],"treatment_mean": [raw_df["treatment"].mean()],"outcome_mean": [raw_df["outcome"].mean()],"corr_treatment_true_g": [raw_df["treatment"].corr(raw_df["true_g"])],"corr_treatment_true_m": [raw_df["treatment"].corr(raw_df["true_m"])], })save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")save_table(raw_audit, f"{NOTEBOOK_PREFIX}_raw_data_audit.csv")display(field_dictionary)display(raw_audit)
column
role
description
0
unit_id
identifier
Synthetic row identifier; excluded from modeling.
1
x00-x11
observed controls
Numeric pre-treatment controls used to create ...
2
segment
observed control
Categorical control derived from x00 with noise.
3
channel
observed control
Categorical acquisition/source-style control.
4
tenure_band
observed control
Categorical tenure-style control derived from ...
5
true_m
oracle only
True treatment nuisance E[D|X] used only for s...
6
true_g
oracle only
True outcome nuisance component g0(X) used onl...
7
treatment
treatment
Continuous treatment D.
8
outcome
outcome
Continuous outcome Y.
n_rows
n_columns
missing_cells
true_theta
treatment_mean
outcome_mean
corr_treatment_true_g
corr_treatment_true_m
0
1400
20
0
1.3
0.561494
1.611073
0.019642
0.77966
The audit confirms the intended confounding structure: treatment is related to the true treatment nuisance and also correlated with outcome-relevant structure through shared controls. That is exactly the setting where residualizing with good nuisance learners matters.
Encoding Categorical Controls For DoubleML
The raw dataframe contains strings. DoubleMLData validates the control matrix as numeric, so this cell one-hot encodes the categorical fields before creating the backend.
This is a common production lesson. If a learner needs preprocessing, put that preprocessing either upstream in a reproducible feature table or inside an sklearn pipeline that still receives numeric arrays from DoubleML. Do not rely on ad hoc notebook transformations that are impossible to replay.
The backend now has a numeric control matrix and excludes the oracle simulation columns. The same feature table can be used by linear models, tree models, LightGBM, and XGBoost.
Baseline And Oracle Benchmarks
Before comparing learners, we anchor the problem with simple benchmarks:
A naive regression of outcome on treatment only ignores controls.
A full linear regression adjusts linearly for encoded controls but cannot represent all nonlinear nuisance structure.
An oracle residual regression uses the true simulated nuisance functions and gives a best-case reference that is not available in real data.
The oracle row is included for teaching only. Real applications do not know true_m or true_g.
The naive estimate is not a credible causal estimate because treatment is confounded by observed controls. The oracle row shows the target we hope DoubleML approaches when the nuisance functions are learned well enough.
Safe Tuning Workflow Diagram
This diagram summarizes the mental model for learner choice. The causal design comes first. Learners are selected to estimate nuisance functions. Cross-fitting keeps nuisance predictions out-of-sample for the score. Diagnostics and sensitivity checks then decide whether the result is stable enough to report.
from matplotlib.patches import FancyArrowPatch, FancyBboxPatchworkflow_nodes = {"design": {"xy": (0.08, 0.58), "label": "Causal\ndesign", "color": "#dbeafe"},"backend": {"xy": (0.28, 0.58), "label": "Numeric\nbackend", "color": "#e0f2fe"},"learners": {"xy": (0.48, 0.58), "label": "Nuisance\nlearners", "color": "#dcfce7"},"tuning": {"xy": (0.68, 0.58), "label": "Tuning by\nCV loss", "color": "#fef3c7"},"score": {"xy": (0.88, 0.58), "label": "Cross-fit\northogonal score", "color": "#fee2e2"},"unsafe": {"xy": (0.68, 0.28), "label": "Do not tune\nto desired theta", "color": "#f3f4f6"},}fig, ax = plt.subplots(figsize=(13, 5.2))ax.set_axis_off()ax.set_xlim(0, 1)ax.set_ylim(0.05, 0.90)box_w, box_h =0.13, 0.14def box_anchor(name, side): x, y = workflow_nodes[name]["xy"] offsets = {"left": (-box_w /2, 0), "right": (box_w /2, 0), "top": (0, box_h /2), "bottom": (0, -box_h /2)} dx, dy = offsets[side]return (x + dx, y + dy)def workflow_arrow(start, end, color="#334155", style="solid", rad=0.0): arrow = FancyArrowPatch( start, end, arrowstyle="-|>", mutation_scale=18, linewidth=1.7, color=color, linestyle=style, connectionstyle=f"arc3,rad={rad}", shrinkA=10, shrinkB=10, zorder=5, ) ax.add_patch(arrow)for spec in workflow_nodes.values(): x, y = spec["xy"] rect = FancyBboxPatch( (x - box_w /2, y - box_h /2), box_w, box_h, boxstyle="round,pad=0.018", facecolor=spec["color"], edgecolor="#334155", linewidth=1.2, zorder=3, ) ax.add_patch(rect) ax.text(x, y, spec["label"], ha="center", va="center", fontsize=10.5, fontweight="bold", zorder=4)# Draw arrows after the boxes and leave a small gap at each endpoint so arrowheads stay visible.for left, right in [("design", "backend"), ("backend", "learners"), ("learners", "tuning"), ("tuning", "score")]: workflow_arrow(box_anchor(left, "right"), box_anchor(right, "left"))workflow_arrow(box_anchor("unsafe", "top"), box_anchor("tuning", "bottom"), color="#991b1b", style="dashed", rad=0.0)ax.text(0.50,0.10,"Choose and tune learners using prediction diagnostics and pre-specified rules, not by chasing a preferred causal estimate.", ha="center", va="center", fontsize=10, color="#475569",)ax.set_title("Learner Choice And Tuning Workflow", pad=14)plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_safe_tuning_workflow.png", bbox_inches="tight")plt.show()
The diagram separates two forms of feedback. Prediction diagnostics can inform learner selection. The final causal estimate should not be repeatedly used as a steering wheel until the answer looks attractive.
Learner Families To Compare
We define a compact learner registry. The point is not to test every model available in Python. The point is to cover the main practical families:
regularized linear models as stable baselines,
bagged trees,
sklearn-native boosted trees,
LightGBM,
XGBoost.
All learners are configured conservatively so the notebook runs quickly and avoids extremely expressive settings that overfit a small teaching dataset.
The pipeline rows show an important sklearn pattern: preprocessing can be part of the learner object. For Ridge and Lasso, scaling is inside the pipeline, so every cross-fitting fold scales using only the training portion of that fold.
Fit Learner Comparison Under Common Splits
This cell fits one PLR model per learner family. Each model uses the same K-fold sample splitting, which makes the comparison easier to read. We record the treatment effect estimate, standard error, confidence interval, cross-fitted nuisance RMSE, and runtime.
A practical warning: this is still a teaching comparison. In real work, learner selection should be planned before looking too hard at the causal estimate.
The table makes the trade-off visible. Flexible learners often reduce nuisance RMSE, but the causal estimate does not move one-for-one with prediction error. Orthogonalization makes the estimate less sensitive to small nuisance errors, but it does not make learner choice irrelevant.
Estimate Comparison Plot
The next figure puts the confidence intervals on the same axis. A useful plot for DoubleML learner comparisons should show uncertainty, not only point estimates. Otherwise, small differences between learners can look more meaningful than they are.
Most learner choices recover the true effect reasonably well on this design, while the naive estimate sits elsewhere. The plot is a reminder that the main value of DoubleML is not a particular learner; it is the combination of a valid score, cross-fitting, and nuisance adjustment.
Nuisance Quality Versus Estimate Error
This scatterplot connects predictive quality to causal accuracy in the simulation. The x-axis is average nuisance RMSE, the y-axis is absolute bias against the known true effect, and point size reflects runtime.
In real data, the y-axis is unavailable because the true effect is unknown. That is why this plot is pedagogical, not a model-selection recipe.
The relationship is informative but imperfect. Better nuisance models usually help, yet finite-sample variation and regularization choices can still move the final estimate. This is why learner diagnostics should be combined with sensitivity checks and repeated sample-splitting checks in serious work.
Inspect Cross-Fitted Nuisance Predictions
A single RMSE number can hide patterns. Here we compare observed values with cross-fitted nuisance predictions for the LightGBM model. The predictions are out-of-fold predictions created by DoubleML, which makes them the right object to inspect.
The outcome nuisance is harder than the treatment nuisance in this synthetic design because the outcome contains treatment-driven variation and a nonlinear baseline component. Seeing this split helps decide where tuning effort should go.
Nuisance Prediction Scatterplots
The scatterplots below show where the selected learner fits well and where it struggles. For a real analysis, this type of plot is not proof of causal validity, but it can reveal obvious underfitting or data-quality issues.
The treatment nuisance predictions are tighter because the treatment equation is easier to learn. The outcome nuisance remains noisier, which is normal when the outcome contains both treatment variation and irreducible noise.
Library-Native Hyperparameter Tuning With Optuna
DoubleML provides tune_ml_models(), which uses Optuna to tune nuisance learners. The parameter spaces are functions that receive an Optuna trial and return a parameter dictionary.
This section tunes a LightGBM learner for both ml_l and ml_m. The search is intentionally small: enough to teach the workflow, not enough to spend minutes chasing marginal improvements.
Important distinction: tuning is based on nuisance prediction scores, here negative root mean squared error. We are not asking Optuna to optimize the treatment-effect estimate.
Tuning runtime: 3.72 seconds
Final tuned fit runtime: 0.68 seconds
The tuning cell separates two costs: the Optuna search time and the final DoubleML fit time. In a real workflow, the search budget should be decided before the analysis begins, especially when many learners and outcomes are being compared.
Tuning Result Details
The returned Optuna objects contain the best parameters and the nuisance score for each learner. We flatten them into a table so the chosen hyperparameters become part of the audit trail.
tune_rows = []for result_bundle in tune_result:for learner_key, result in result_bundle.items(): tune_rows.append( {"learner_key": learner_key,"best_score_neg_rmse": float(result.best_score),"best_rmse": float(-result.best_score),"best_params": json.dumps(result.best_params, sort_keys=True), } )tune_summary = pd.DataFrame(tune_rows)tuned_comparison = pd.DataFrame([untuned_lgbm_row, tuned_lgbm_row])save_table(tune_summary, f"{NOTEBOOK_PREFIX}_optuna_tune_summary.csv")save_table(tuned_comparison, f"{NOTEBOOK_PREFIX}_tuned_vs_untuned_lgbm.csv")display(tune_summary)display(tuned_comparison)
learner_key
best_score_neg_rmse
best_rmse
best_params
0
ml_l
-1.851373
1.851373
{"colsample_bytree": 0.9896896099223678, "lear...
1
ml_m
-1.124470
1.124470
{"colsample_bytree": 0.7094287557060203, "lear...
learner
theta_hat
se
ci_95_lower
ci_95_upper
bias_vs_true
abs_bias_vs_true
rmse_ml_l
rmse_ml_m
mae_ml_l
mae_ml_m
avg_nuisance_rmse
runtime_seconds
tuning_runtime_seconds
0
LightGBM untuned reference
1.219688
0.029935
1.161017
1.278359
-0.080312
0.080312
1.858091
1.135388
1.479899
0.896752
1.49674
1.05248
NaN
1
LightGBM tuned with tune_ml_models
1.247427
0.029360
1.189881
1.304972
-0.052573
0.052573
1.825092
1.116527
1.444504
0.878563
1.47081
0.67705
3.717984
The tuned model may or may not dominate the untuned reference on every metric. That is normal. Tuning is a disciplined way to search the learner space, not a guarantee that every finite-sample estimate improves.
Tuned Versus Untuned Plot
This plot compares the LightGBM reference with the tuned LightGBM model. The useful question is not simply which point is closer to the true effect in this simulation. The useful question is whether tuning improves nuisance quality enough to justify the extra complexity and runtime.
The side-by-side view keeps causal and predictive diagnostics together. When tuning changes the causal estimate, the next question should be whether the nuisance predictions, overlap, residuals, or sensitivity checks explain why.
Honest Learner Selection Split
When many learner families are compared, there is a risk of choosing the model because the final estimate looks convenient. One practical discipline is to use an exploration split to choose a learner by nuisance criteria, then run the final estimate on a separate estimation split.
This is not mandatory for every DoubleML analysis, and it costs sample size. But it is a useful teaching device for separating model selection from final effect reporting.
The exploration split chooses a learner using nuisance RMSE, then the estimation split reports the treatment effect. This is more conservative than selecting the final model after browsing all causal estimates.
Unsafe Tuning Patterns
The next table is not code for fitting a model. It is a checklist of habits to avoid. These mistakes are common because they feel like ordinary machine-learning iteration, but they weaken the credibility of causal reporting.
unsafe_patterns = pd.DataFrame( [ {"pattern": "Choose the learner whose theta is closest to a preferred answer","why_it_is_unsafe": "The final estimate becomes part of the selection rule, so uncertainty is understated and the design is no longer pre-specified.","safer_alternative": "Choose learners using nuisance prediction, stability rules, or an honest exploration split.", }, {"pattern": "Tune on the full analysis until confidence intervals become significant","why_it_is_unsafe": "Repeated searching changes the meaning of the reported interval.","safer_alternative": "Fix the tuning budget and report sensitivity to reasonable alternatives.", }, {"pattern": "Preprocess using outcome information outside a fold","why_it_is_unsafe": "Information from held-out rows can leak into nuisance predictions.","safer_alternative": "Put preprocessing inside sklearn pipelines or build reproducible pre-treatment features only.", }, {"pattern": "Report only the best-looking learner","why_it_is_unsafe": "Readers cannot see whether the result is stable across defensible choices.","safer_alternative": "Show a compact learner sensitivity table and explain the chosen primary specification.", }, ])save_table(unsafe_patterns, f"{NOTEBOOK_PREFIX}_unsafe_tuning_patterns.csv")display(unsafe_patterns)
pattern
why_it_is_unsafe
safer_alternative
0
Choose the learner whose theta is closest to a...
The final estimate becomes part of the selecti...
Choose learners using nuisance prediction, sta...
1
Tune on the full analysis until confidence int...
Repeated searching changes the meaning of the ...
Fix the tuning budget and report sensitivity t...
2
Preprocess using outcome information outside a...
Information from held-out rows can leak into n...
Put preprocessing inside sklearn pipelines or ...
3
Report only the best-looking learner
Readers cannot see whether the result is stabl...
Show a compact learner sensitivity table and e...
The safe alternative is not to avoid tuning. The safe alternative is to make tuning auditable: define the search space, define the score, control the search budget, and show robustness to plausible learner choices.
Reporting Checklist
A good learner-tuning section in a causal notebook should be short but complete. The table below is a reusable reporting checklist.
reporting_checklist = pd.DataFrame( [ {"item": "State the causal estimand before learner selection", "status": "required"}, {"item": "List the nuisance learners and their targets", "status": "required"}, {"item": "Describe preprocessing and confirm controls are pre-treatment", "status": "required"}, {"item": "Report cross-fitting folds and repeated-split choices", "status": "required"}, {"item": "Show nuisance prediction diagnostics", "status": "recommended"}, {"item": "Show effect stability across a small set of defensible learners", "status": "recommended"}, {"item": "Document tuning search spaces, scores, and trial budgets", "status": "required if tuned"}, {"item": "Avoid choosing models based on desired treatment-effect estimates", "status": "required"}, ])save_table(reporting_checklist, f"{NOTEBOOK_PREFIX}_learner_tuning_reporting_checklist.csv")display(reporting_checklist)
item
status
0
State the causal estimand before learner selec...
required
1
List the nuisance learners and their targets
required
2
Describe preprocessing and confirm controls ar...
required
3
Report cross-fitting folds and repeated-split ...
required
4
Show nuisance prediction diagnostics
recommended
5
Show effect stability across a small set of de...
recommended
6
Document tuning search spaces, scores, and tri...
required if tuned
7
Avoid choosing models based on desired treatme...
required
This checklist is intentionally practical. It turns learner choice from a hidden modeling preference into a reproducible part of the causal analysis.
Report Template
The final cell writes a short report template. The template is not meant to replace analysis; it is meant to keep the final write-up disciplined and consistent.
report_text =f"""# Learner And Tuning Report Template## Estimand- Treatment:- Outcome:- Control set:- Target population:- Identification assumptions:## Nuisance Learners- Primary `ml_l` learner:- Primary `ml_m` learner:- Reason for primary learner choice:- Alternative learners checked:## Preprocessing- Numeric controls:- Encoded categorical controls:- Excluded columns:- Pipeline steps:## Cross-Fitting And Tuning- Number of folds:- Repeated sample splitting:- Tuning method:- Tuning score:- Search budget:- Search space summary:## Diagnostics- Outcome nuisance RMSE:- Treatment nuisance RMSE:- Estimate stability across learners:- Runtime considerations:## Final Estimate- Point estimate:- Standard error:- Confidence interval:- Main caveats:""".strip()report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_learner_tuning_report_template.md"report_path.write_text(report_text)artifact_manifest = pd.DataFrame( [ {"artifact": "raw synthetic data", "path": str(DATASET_DIR /f"{NOTEBOOK_PREFIX}_raw_synthetic_plr_tuning_data.csv")}, {"artifact": "encoded synthetic data", "path": str(DATASET_DIR /f"{NOTEBOOK_PREFIX}_encoded_synthetic_plr_tuning_data.csv")}, {"artifact": "learner comparison", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_learner_comparison.csv")}, {"artifact": "Optuna tuning summary", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_optuna_tune_summary.csv")}, {"artifact": "report template", "path": str(report_path)}, {"artifact": "workflow figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_safe_tuning_workflow.png")}, {"artifact": "estimate comparison figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_learner_estimate_comparison.png")}, ])save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")display(Markdown(f"Report template written to `{report_path}`"))display(artifact_manifest)
Report template written to /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/10_learner_tuning_report_template.md
artifact
path
0
raw synthetic data
/home/apex/Documents/ranking_sys/notebooks/tut...
1
encoded synthetic data
/home/apex/Documents/ranking_sys/notebooks/tut...
2
learner comparison
/home/apex/Documents/ranking_sys/notebooks/tut...
3
Optuna tuning summary
/home/apex/Documents/ranking_sys/notebooks/tut...
4
report template
/home/apex/Documents/ranking_sys/notebooks/tut...
5
workflow figure
/home/apex/Documents/ranking_sys/notebooks/tut...
6
estimate comparison figure
/home/apex/Documents/ranking_sys/notebooks/tut...
The notebook now has all pieces needed for a learner-tuning tutorial: a known synthetic target, learner-family comparison, nuisance diagnostics, library-native tuning, an honest-selection pattern, and reporting artifacts.
What Comes Next
The natural next topic is sample splitting itself: how cross-fitting works, what repeated cross-fitting changes, how to inspect split stability, and when repeated sample splits are worth the extra runtime.