DoubleML Tutorial 10: Learners, Hyperparameters, And Tuning

This notebook is about the practical part of DoubleML that often decides whether an analysis feels robust or fragile: choosing nuisance learners and tuning them without letting the causal estimate become a model-selection toy.

Double machine learning separates two jobs that are easy to blur:

In a partially linear regression model, the target is usually written as

[ Y = D + g_0(X) + , ]

with a treatment equation

[ D = m_0(X) + V. ]

DoubleML estimates the nuisance functions (g_0(X)) and (m_0(X)), residualizes (Y) and (D), and then estimates () from an orthogonal score. The important point is that better nuisance prediction can help, but the final goal is not to win a prediction contest. The final goal is a credible estimate of () under valid identifying assumptions.

That creates a practical tension. Flexible models such as boosted trees can capture nonlinear confounding patterns, but they also introduce many hyperparameters. Tuning can improve nuisance quality, but careless tuning can leak information from the final estimate back into model choice. This notebook teaches a workflow that is useful in real analysis:

  1. Build a numeric DoubleML backend.
  2. Compare sensible learner families under common sample splits.
  3. Diagnose nuisance prediction quality alongside treatment-effect estimates.
  4. Use pipelines where preprocessing is part of the learner.
  5. Tune hyperparameters with DoubleML’s Optuna-based API.
  6. Separate safe model-selection criteria from unsafe causal-estimate shopping.

Expected runtime: usually under one minute on this synthetic dataset. The Optuna section intentionally uses a small number of trials so the notebook stays tutorial-friendly.

Setup

This cell creates the same output folders used across the DoubleML tutorial series, sets a local Matplotlib cache to avoid environment warnings, and imports the packages used in the notebook. The warning filters are intentionally narrow: they suppress known package-level noise while keeping genuine modeling problems visible.

from pathlib import Path
import json
import os
import time
import warnings

PROJECT_ROOT = Path.cwd().resolve()
if PROJECT_ROOT.name == "doubleml":
    PROJECT_ROOT = PROJECT_ROOT.parents[2]

OUTPUT_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MATPLOTLIB_CACHE_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")
warnings.filterwarnings("ignore", message="IProgress not found.*")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

import doubleml as dml
from doubleml import DoubleMLData, DoubleMLPLR

from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LassoCV, LinearRegression, RidgeCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
import optuna

NOTEBOOK_PREFIX = "10"
RANDOM_SEED = 42
TRUE_THETA = 1.30

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams.update({"figure.dpi": 120, "savefig.dpi": 160})
optuna.logging.set_verbosity(optuna.logging.WARNING)

print(f"DoubleML version: {dml.__version__}")
print(f"Output directory: {OUTPUT_DIR}")
DoubleML version: 0.11.2
Output directory: /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs

The setup confirms that the notebook is running against the local environment and that all artifacts will be written into the DoubleML tutorial output folder. Keeping outputs organized by notebook prefix makes it much easier to review the tutorial later.

Helper Functions

The helper functions below keep the notebook focused on causal reasoning instead of repetitive file and formatting code. There are three especially important helpers:

  • make_plr_data() creates a DoubleML-ready backend for a given dataframe.
  • fit_plr_with_learner() fits a PLR model with common cross-fitting logic and returns both the fitted object and a compact summary row.
  • rmse_metric() evaluates nuisance predictions from DoubleML’s stored cross-fitted predictions.

The metric function is written defensively because some DoubleML model classes can contain missing nuisance targets. PLR does not usually create missing targets, but keeping the helper general makes the pattern reusable.

def save_table(df, filename):
    path = TABLE_DIR / filename
    df.to_csv(path, index=False)
    return path


def save_dataset(df, filename):
    path = DATASET_DIR / filename
    df.to_csv(path, index=False)
    return path


def rmse_metric(y_true, y_pred):
    mask = ~np.isnan(y_true)
    return mean_squared_error(y_true[mask], y_pred[mask]) ** 0.5


def mae_metric(y_true, y_pred):
    mask = ~np.isnan(y_true)
    return mean_absolute_error(y_true[mask], y_pred[mask])


def model_x_cols(df):
    excluded = {"unit_id", "outcome", "treatment", "true_g", "true_m"}
    return [col for col in df.columns if col not in excluded]


def make_plr_data(df):
    return DoubleMLData(df, y_col="outcome", d_cols="treatment", x_cols=model_x_cols(df))


def make_common_splits(df, n_folds=5, seed=RANDOM_SEED):
    splitter = KFold(n_splits=n_folds, shuffle=True, random_state=seed)
    return list(splitter.split(df))


def fit_plr_with_learner(df, learner_name, ml_l, ml_m=None, n_folds=5, seed=RANDOM_SEED):
    if ml_m is None:
        ml_m = ml_l

    dml_data = make_plr_data(df)
    sample_splits = make_common_splits(df, n_folds=n_folds, seed=seed)
    plr = DoubleMLPLR(
        dml_data,
        ml_l=clone(ml_l),
        ml_m=clone(ml_m),
        n_folds=n_folds,
        draw_sample_splitting=False,
    )
    plr.set_sample_splitting(sample_splits)

    start = time.perf_counter()
    plr.fit()
    runtime_seconds = time.perf_counter() - start

    learner_rmse = plr.evaluate_learners(metric=rmse_metric)
    learner_mae = plr.evaluate_learners(metric=mae_metric)

    row = {
        "learner": learner_name,
        "theta_hat": float(plr.coef[0]),
        "se": float(plr.se[0]),
        "ci_95_lower": float(plr.confint().iloc[0, 0]),
        "ci_95_upper": float(plr.confint().iloc[0, 1]),
        "bias_vs_true": float(plr.coef[0] - TRUE_THETA),
        "abs_bias_vs_true": float(abs(plr.coef[0] - TRUE_THETA)),
        "rmse_ml_l": float(learner_rmse["ml_l"][0, 0]),
        "rmse_ml_m": float(learner_rmse["ml_m"][0, 0]),
        "mae_ml_l": float(learner_mae["ml_l"][0, 0]),
        "mae_ml_m": float(learner_mae["ml_m"][0, 0]),
        "avg_nuisance_rmse": float((learner_rmse["ml_l"][0, 0] + learner_rmse["ml_m"][0, 0]) / 2),
        "runtime_seconds": runtime_seconds,
    }
    return plr, row


def learner_family_table():
    return pd.DataFrame(
        [
            {
                "family": "Penalized linear models",
                "examples": "RidgeCV, LassoCV",
                "strength": "Fast, stable, strong when nuisance functions are close to linear after feature engineering.",
                "watch_out": "Can underfit nonlinear confounding and make residualization too crude.",
            },
            {
                "family": "Random forests",
                "examples": "RandomForestRegressor",
                "strength": "Good default for nonlinearities and interactions with limited tuning.",
                "watch_out": "Can be slower and less smooth; shallow/deep choices matter.",
            },
            {
                "family": "Histogram gradient boosting",
                "examples": "HistGradientBoostingRegressor",
                "strength": "Strong sklearn-native boosted tree option with good speed.",
                "watch_out": "Needs regularization through depth, leaves, learning rate, and minimum leaf size.",
            },
            {
                "family": "External boosted trees",
                "examples": "LightGBM, XGBoost",
                "strength": "Flexible, fast, and common in applied data science workflows.",
                "watch_out": "Large search spaces make tuning discipline important.",
            },
        ]
    )

These helpers encode two habits that matter in DoubleML work. First, the same sample split is reused across learner comparisons, so differences are less likely to be split noise. Second, nuisance quality is summarized next to the treatment-effect estimate rather than treated as a separate modeling exercise.

Learner Roles In DoubleML

Before we simulate data, it is worth naming what each learner is trying to learn. For PLR, ml_l predicts the outcome from controls and ml_m predicts treatment from controls. They are nuisance learners because they are not the estimand. They are tools used to form an orthogonal score.

This table is deliberately conceptual. A strong DoubleML notebook should make clear what each model is estimating before showing a single fitted coefficient.

nuisance_role_table = pd.DataFrame(
    [
        {
            "DoubleML model": "PLR",
            "learner": "ml_l",
            "target": "E[Y | X]",
            "why it matters": "Removes outcome variation explained by observed controls.",
        },
        {
            "DoubleML model": "PLR",
            "learner": "ml_m",
            "target": "E[D | X]",
            "why it matters": "Removes treatment variation explained by observed controls.",
        },
        {
            "DoubleML model": "IRM",
            "learner": "ml_g",
            "target": "E[Y | D, X]",
            "why it matters": "Builds potential-outcome regressions for treated and control states.",
        },
        {
            "DoubleML model": "IRM/IIVM",
            "learner": "ml_m",
            "target": "P[D = 1 | X] or P[Z = 1 | X]",
            "why it matters": "Controls weighting and overlap behavior through propensity-style models.",
        },
    ]
)

save_table(nuisance_role_table, f"{NOTEBOOK_PREFIX}_nuisance_role_table.csv")
display(nuisance_role_table)
DoubleML model learner target why it matters
0 PLR ml_l E[Y | X] Removes outcome variation explained by observe...
1 PLR ml_m E[D | X] Removes treatment variation explained by obser...
2 IRM ml_g E[Y | D, X] Builds potential-outcome regressions for treat...
3 IRM/IIVM ml_m P[D = 1 | X] or P[Z = 1 | X] Controls weighting and overlap behavior throug...

The key lesson is that learners are chosen for the nuisance task, not because they sound advanced. If the treatment model is hard and the outcome model is easy, it can be reasonable to use different learners for ml_l and ml_m.

Synthetic PLR Design

We now create a synthetic PLR dataset with nonlinear confounding. The treatment is continuous, the outcome is continuous, and the true treatment effect is known: TRUE_THETA = 1.30.

The data intentionally contains raw categorical fields such as segment, channel, and tenure_band. This lets us teach a practical preprocessing point: DoubleML’s backend expects the control matrix to be numeric, so categorical variables should be encoded before creating DoubleMLData, or handled in a way that still presents numeric controls to DoubleML.

rng = np.random.default_rng(RANDOM_SEED)
n_obs = 1_400
n_numeric = 12

x_array = rng.normal(size=(n_obs, n_numeric))
raw_df = pd.DataFrame(x_array, columns=[f"x{i:02d}" for i in range(n_numeric)])
raw_df.insert(0, "unit_id", np.arange(n_obs))

raw_df["segment"] = pd.cut(
    raw_df["x00"] + 0.40 * rng.normal(size=n_obs),
    bins=[-np.inf, -0.70, 0.55, np.inf],
    labels=["low", "mid", "high"],
).astype(str)
raw_df["channel"] = rng.choice(["organic", "paid", "partner"], size=n_obs, p=[0.55, 0.30, 0.15])
raw_df["tenure_band"] = pd.cut(
    raw_df["x01"] + 0.20 * rng.normal(size=n_obs),
    bins=[-np.inf, -0.80, 0.20, 1.00, np.inf],
    labels=["new", "growing", "mature", "loyal"],
).astype(str)

segment_high = (raw_df["segment"] == "high").astype(float)
paid_channel = (raw_df["channel"] == "paid").astype(float)
loyal_tenure = (raw_df["tenure_band"] == "loyal").astype(float)

true_m = (
    0.80 * np.sin(raw_df["x00"])
    + 0.50 * raw_df["x01"] ** 2
    - 0.45 * raw_df["x02"]
    + 0.35 * segment_high
    - 0.20 * paid_channel
    + 0.40 * raw_df["x05"] * raw_df["x06"]
)
true_g = (
    1.00 * np.cos(raw_df["x00"])
    + 0.45 * raw_df["x03"] * raw_df["x04"]
    + 0.60 * np.tanh(raw_df["x07"])
    + 0.35 * loyal_tenure
    - 0.25 * paid_channel
    + 0.25 * raw_df["x08"] ** 2
)

raw_df["true_m"] = true_m
raw_df["true_g"] = true_g
raw_df["treatment"] = true_m + rng.normal(scale=1.00, size=n_obs)
raw_df["outcome"] = TRUE_THETA * raw_df["treatment"] + true_g + rng.normal(scale=1.00, size=n_obs)

save_dataset(raw_df, f"{NOTEBOOK_PREFIX}_raw_synthetic_plr_tuning_data.csv")
raw_df.head()
unit_id x00 x01 x02 x03 x04 x05 x06 x07 x08 x09 x10 x11 segment channel tenure_band true_m true_g treatment outcome
0 0 0.304717 -1.039984 0.750451 0.940565 -1.951035 -1.302180 0.127840 -0.316243 -0.016801 -0.853044 0.879398 0.777792 high paid new 0.526511 -0.305445 0.991977 0.377864
1 1 0.066031 1.127241 0.467509 -0.859292 0.368751 -0.958883 0.878450 -0.049926 -0.184862 -0.680930 1.222541 -0.154529 mid paid loyal -0.059189 0.933844 -0.768898 1.113530
2 2 -0.428328 -0.352134 0.532309 0.365444 0.412733 0.430821 2.141648 -0.406415 -0.512243 -0.813773 0.615979 1.128972 mid organic growing -0.140754 0.811879 -0.605066 0.269105
3 3 -0.113947 -0.840156 -0.824481 0.650593 0.743254 0.543154 -0.665510 0.232161 0.116686 0.218689 0.871429 0.223596 mid organic growing 0.488397 1.351366 1.168090 3.133282
4 4 0.678914 0.067579 0.289119 0.631288 -1.457156 -0.319671 -0.470373 -0.638878 -0.275142 1.494941 -0.865831 0.968278 mid organic mature 0.434684 0.044752 1.811592 3.283717

The first rows show the teaching data before encoding. The columns true_m and true_g are included only because this is a simulation; they let us build an oracle benchmark later. They must not be used as controls in the DoubleML backend.

Data Dictionary And Audit

A learner-tuning notebook should still start with data roles. Hyperparameter tuning does not rescue a confused design. This cell records which columns are identifiers, raw controls, oracle-only simulation columns, treatment, and outcome.

field_dictionary = pd.DataFrame(
    [
        {"column": "unit_id", "role": "identifier", "description": "Synthetic row identifier; excluded from modeling."},
        {"column": "x00-x11", "role": "observed controls", "description": "Numeric pre-treatment controls used to create confounding."},
        {"column": "segment", "role": "observed control", "description": "Categorical control derived from x00 with noise."},
        {"column": "channel", "role": "observed control", "description": "Categorical acquisition/source-style control."},
        {"column": "tenure_band", "role": "observed control", "description": "Categorical tenure-style control derived from x01 with noise."},
        {"column": "true_m", "role": "oracle only", "description": "True treatment nuisance E[D|X] used only for simulation diagnostics."},
        {"column": "true_g", "role": "oracle only", "description": "True outcome nuisance component g0(X) used only for simulation diagnostics."},
        {"column": "treatment", "role": "treatment", "description": "Continuous treatment D."},
        {"column": "outcome", "role": "outcome", "description": "Continuous outcome Y."},
    ]
)

raw_audit = pd.DataFrame(
    {
        "n_rows": [len(raw_df)],
        "n_columns": [raw_df.shape[1]],
        "missing_cells": [int(raw_df.isna().sum().sum())],
        "true_theta": [TRUE_THETA],
        "treatment_mean": [raw_df["treatment"].mean()],
        "outcome_mean": [raw_df["outcome"].mean()],
        "corr_treatment_true_g": [raw_df["treatment"].corr(raw_df["true_g"])],
        "corr_treatment_true_m": [raw_df["treatment"].corr(raw_df["true_m"])],
    }
)

save_table(field_dictionary, f"{NOTEBOOK_PREFIX}_field_dictionary.csv")
save_table(raw_audit, f"{NOTEBOOK_PREFIX}_raw_data_audit.csv")

display(field_dictionary)
display(raw_audit)
column role description
0 unit_id identifier Synthetic row identifier; excluded from modeling.
1 x00-x11 observed controls Numeric pre-treatment controls used to create ...
2 segment observed control Categorical control derived from x00 with noise.
3 channel observed control Categorical acquisition/source-style control.
4 tenure_band observed control Categorical tenure-style control derived from ...
5 true_m oracle only True treatment nuisance E[D|X] used only for s...
6 true_g oracle only True outcome nuisance component g0(X) used onl...
7 treatment treatment Continuous treatment D.
8 outcome outcome Continuous outcome Y.
n_rows n_columns missing_cells true_theta treatment_mean outcome_mean corr_treatment_true_g corr_treatment_true_m
0 1400 20 0 1.3 0.561494 1.611073 0.019642 0.77966

The audit confirms the intended confounding structure: treatment is related to the true treatment nuisance and also correlated with outcome-relevant structure through shared controls. That is exactly the setting where residualizing with good nuisance learners matters.

Encoding Categorical Controls For DoubleML

The raw dataframe contains strings. DoubleMLData validates the control matrix as numeric, so this cell one-hot encodes the categorical fields before creating the backend.

This is a common production lesson. If a learner needs preprocessing, put that preprocessing either upstream in a reproducible feature table or inside an sklearn pipeline that still receives numeric arrays from DoubleML. Do not rely on ad hoc notebook transformations that are impossible to replay.

encoded_df = pd.get_dummies(
    raw_df,
    columns=["segment", "channel", "tenure_band"],
    drop_first=False,
    dtype=float,
)

x_cols = model_x_cols(encoded_df)
dml_data = make_plr_data(encoded_df)

encoding_summary = pd.DataFrame(
    {
        "raw_rows": [raw_df.shape[0]],
        "raw_columns": [raw_df.shape[1]],
        "encoded_columns": [encoded_df.shape[1]],
        "model_control_columns": [len(x_cols)],
        "oracle_columns_excluded": ["true_m, true_g"],
    }
)

save_dataset(encoded_df, f"{NOTEBOOK_PREFIX}_encoded_synthetic_plr_tuning_data.csv")
save_table(encoding_summary, f"{NOTEBOOK_PREFIX}_encoding_summary.csv")

display(encoding_summary)
print(dml_data)
raw_rows raw_columns encoded_columns model_control_columns oracle_columns_excluded
0 1400 20 27 22 true_m, true_g
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: outcome
Treatment variable(s): ['treatment']
Covariates: ['x00', 'x01', 'x02', 'x03', 'x04', 'x05', 'x06', 'x07', 'x08', 'x09', 'x10', 'x11', 'segment_high', 'segment_low', 'segment_mid', 'channel_organic', 'channel_paid', 'channel_partner', 'tenure_band_growing', 'tenure_band_loyal', 'tenure_band_mature', 'tenure_band_new']
Instrument variable(s): None
No. Observations: 1400
------------------ DataFrame info    ------------------
<class 'pandas.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Columns: 27 entries, unit_id to tenure_band_new
dtypes: float64(26), int64(1)
memory usage: 295.4 KB

The backend now has a numeric control matrix and excludes the oracle simulation columns. The same feature table can be used by linear models, tree models, LightGBM, and XGBoost.

Baseline And Oracle Benchmarks

Before comparing learners, we anchor the problem with simple benchmarks:

  • A naive regression of outcome on treatment only ignores controls.
  • A full linear regression adjusts linearly for encoded controls but cannot represent all nonlinear nuisance structure.
  • An oracle residual regression uses the true simulated nuisance functions and gives a best-case reference that is not available in real data.

The oracle row is included for teaching only. Real applications do not know true_m or true_g.

def treatment_coefficient(y, d, controls=None):
    d_matrix = np.asarray(d).reshape(-1, 1)
    if controls is None:
        design = d_matrix
    else:
        design = np.column_stack([d_matrix, np.asarray(controls)])
    model = LinearRegression().fit(design, y)
    return float(model.coef_[0])

naive_theta = treatment_coefficient(raw_df["outcome"], raw_df["treatment"])
linear_adjusted_theta = treatment_coefficient(
    encoded_df["outcome"],
    encoded_df["treatment"],
    encoded_df[x_cols],
)
oracle_theta = treatment_coefficient(
    raw_df["outcome"] - raw_df["true_g"],
    raw_df["treatment"] - raw_df["true_m"],
)

baseline_table = pd.DataFrame(
    [
        {"estimator": "Naive outcome ~ treatment", "theta_hat": naive_theta, "uses_controls": False},
        {"estimator": "Full linear adjustment", "theta_hat": linear_adjusted_theta, "uses_controls": True},
        {"estimator": "Oracle residual regression", "theta_hat": oracle_theta, "uses_controls": "oracle"},
    ]
)
baseline_table["true_theta"] = TRUE_THETA
baseline_table["bias_vs_true"] = baseline_table["theta_hat"] - TRUE_THETA

save_table(baseline_table, f"{NOTEBOOK_PREFIX}_baseline_and_oracle_estimates.csv")
display(baseline_table)
estimator theta_hat uses_controls true_theta bias_vs_true
0 Naive outcome ~ treatment 1.319604 False 1.3 0.019604
1 Full linear adjustment 1.314848 True 1.3 0.014848
2 Oracle residual regression 1.350645 oracle 1.3 0.050645

The naive estimate is not a credible causal estimate because treatment is confounded by observed controls. The oracle row shows the target we hope DoubleML approaches when the nuisance functions are learned well enough.

Safe Tuning Workflow Diagram

This diagram summarizes the mental model for learner choice. The causal design comes first. Learners are selected to estimate nuisance functions. Cross-fitting keeps nuisance predictions out-of-sample for the score. Diagnostics and sensitivity checks then decide whether the result is stable enough to report.

from matplotlib.patches import FancyArrowPatch, FancyBboxPatch

workflow_nodes = {
    "design": {"xy": (0.08, 0.58), "label": "Causal\ndesign", "color": "#dbeafe"},
    "backend": {"xy": (0.28, 0.58), "label": "Numeric\nbackend", "color": "#e0f2fe"},
    "learners": {"xy": (0.48, 0.58), "label": "Nuisance\nlearners", "color": "#dcfce7"},
    "tuning": {"xy": (0.68, 0.58), "label": "Tuning by\nCV loss", "color": "#fef3c7"},
    "score": {"xy": (0.88, 0.58), "label": "Cross-fit\northogonal score", "color": "#fee2e2"},
    "unsafe": {"xy": (0.68, 0.28), "label": "Do not tune\nto desired theta", "color": "#f3f4f6"},
}

fig, ax = plt.subplots(figsize=(13, 5.2))
ax.set_axis_off()
ax.set_xlim(0, 1)
ax.set_ylim(0.05, 0.90)
box_w, box_h = 0.13, 0.14


def box_anchor(name, side):
    x, y = workflow_nodes[name]["xy"]
    offsets = {"left": (-box_w / 2, 0), "right": (box_w / 2, 0), "top": (0, box_h / 2), "bottom": (0, -box_h / 2)}
    dx, dy = offsets[side]
    return (x + dx, y + dy)


def workflow_arrow(start, end, color="#334155", style="solid", rad=0.0):
    arrow = FancyArrowPatch(
        start,
        end,
        arrowstyle="-|>",
        mutation_scale=18,
        linewidth=1.7,
        color=color,
        linestyle=style,
        connectionstyle=f"arc3,rad={rad}",
        shrinkA=10,
        shrinkB=10,
        zorder=5,
    )
    ax.add_patch(arrow)

for spec in workflow_nodes.values():
    x, y = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.018",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y, spec["label"], ha="center", va="center", fontsize=10.5, fontweight="bold", zorder=4)

# Draw arrows after the boxes and leave a small gap at each endpoint so arrowheads stay visible.
for left, right in [("design", "backend"), ("backend", "learners"), ("learners", "tuning"), ("tuning", "score")]:
    workflow_arrow(box_anchor(left, "right"), box_anchor(right, "left"))

workflow_arrow(box_anchor("unsafe", "top"), box_anchor("tuning", "bottom"), color="#991b1b", style="dashed", rad=0.0)
ax.text(
    0.50,
    0.10,
    "Choose and tune learners using prediction diagnostics and pre-specified rules, not by chasing a preferred causal estimate.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("Learner Choice And Tuning Workflow", pad=14)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_safe_tuning_workflow.png", bbox_inches="tight")
plt.show()

The diagram separates two forms of feedback. Prediction diagnostics can inform learner selection. The final causal estimate should not be repeatedly used as a steering wheel until the answer looks attractive.

Learner Families To Compare

We define a compact learner registry. The point is not to test every model available in Python. The point is to cover the main practical families:

  • regularized linear models as stable baselines,
  • bagged trees,
  • sklearn-native boosted trees,
  • LightGBM,
  • XGBoost.

All learners are configured conservatively so the notebook runs quickly and avoids extremely expressive settings that overfit a small teaching dataset.

learner_registry = {
    "Ridge pipeline": Pipeline(
        [
            ("scale", StandardScaler()),
            ("model", RidgeCV(alphas=np.logspace(-3, 3, 15))),
        ]
    ),
    "LassoCV pipeline": Pipeline(
        [
            ("scale", StandardScaler()),
            ("model", LassoCV(cv=4, random_state=RANDOM_SEED, max_iter=5_000)),
        ]
    ),
    "Random forest": RandomForestRegressor(
        n_estimators=100,
        max_depth=7,
        min_samples_leaf=8,
        max_features=0.70,
        n_jobs=1,
        random_state=RANDOM_SEED,
    ),
    "HistGradientBoosting": HistGradientBoostingRegressor(
        max_iter=100,
        max_leaf_nodes=17,
        min_samples_leaf=20,
        learning_rate=0.06,
        random_state=RANDOM_SEED,
    ),
    "LightGBM": LGBMRegressor(
        n_estimators=120,
        learning_rate=0.04,
        num_leaves=24,
        min_child_samples=25,
        subsample=0.85,
        colsample_bytree=0.85,
        random_state=RANDOM_SEED,
        verbose=-1,
        n_jobs=1,
    ),
    "XGBoost": XGBRegressor(
        n_estimators=120,
        max_depth=3,
        learning_rate=0.04,
        subsample=0.85,
        colsample_bytree=0.85,
        reg_lambda=2.0,
        objective="reg:squarederror",
        random_state=RANDOM_SEED,
        n_jobs=1,
        verbosity=0,
    ),
}

family_table = learner_family_table()
pipeline_table = pd.DataFrame(
    [
        {"learner": name, "object_type": type(learner).__name__, "repr": repr(learner)[:180]}
        for name, learner in learner_registry.items()
    ]
)

save_table(family_table, f"{NOTEBOOK_PREFIX}_learner_family_table.csv")
save_table(pipeline_table, f"{NOTEBOOK_PREFIX}_learner_registry.csv")

display(family_table)
display(pipeline_table)
family examples strength watch_out
0 Penalized linear models RidgeCV, LassoCV Fast, stable, strong when nuisance functions a... Can underfit nonlinear confounding and make re...
1 Random forests RandomForestRegressor Good default for nonlinearities and interactio... Can be slower and less smooth; shallow/deep ch...
2 Histogram gradient boosting HistGradientBoostingRegressor Strong sklearn-native boosted tree option with... Needs regularization through depth, leaves, le...
3 External boosted trees LightGBM, XGBoost Flexible, fast, and common in applied data sci... Large search spaces make tuning discipline imp...
learner object_type repr
0 Ridge pipeline Pipeline Pipeline(steps=[('scale', StandardScaler()),\n...
1 LassoCV pipeline Pipeline Pipeline(steps=[('scale', StandardScaler()),\n...
2 Random forest RandomForestRegressor RandomForestRegressor(max_depth=7, max_feature...
3 HistGradientBoosting HistGradientBoostingRegressor HistGradientBoostingRegressor(learning_rate=0....
4 LightGBM LGBMRegressor LGBMRegressor(colsample_bytree=0.85, learning_...
5 XGBoost XGBRegressor XGBRegressor(base_score=None, booster=None, ca...

The pipeline rows show an important sklearn pattern: preprocessing can be part of the learner object. For Ridge and Lasso, scaling is inside the pipeline, so every cross-fitting fold scales using only the training portion of that fold.

Fit Learner Comparison Under Common Splits

This cell fits one PLR model per learner family. Each model uses the same K-fold sample splitting, which makes the comparison easier to read. We record the treatment effect estimate, standard error, confidence interval, cross-fitted nuisance RMSE, and runtime.

A practical warning: this is still a teaching comparison. In real work, learner selection should be planned before looking too hard at the causal estimate.

comparison_rows = []
fitted_models = {}

for learner_name, learner in learner_registry.items():
    fitted_model, row = fit_plr_with_learner(
        encoded_df,
        learner_name=learner_name,
        ml_l=learner,
        ml_m=learner,
        n_folds=5,
        seed=RANDOM_SEED,
    )
    fitted_models[learner_name] = fitted_model
    comparison_rows.append(row)

learner_comparison = pd.DataFrame(comparison_rows).sort_values("avg_nuisance_rmse").reset_index(drop=True)
save_table(learner_comparison, f"{NOTEBOOK_PREFIX}_learner_comparison.csv")
display(learner_comparison)
learner theta_hat se ci_95_lower ci_95_upper bias_vs_true abs_bias_vs_true rmse_ml_l rmse_ml_m mae_ml_l mae_ml_m avg_nuisance_rmse runtime_seconds
0 HistGradientBoosting 1.260231 0.029922 1.201585 1.318877 -0.039769 0.039769 1.840439 1.114303 1.459938 0.879263 1.477371 2.927575
1 LightGBM 1.272523 0.029346 1.215006 1.330040 -0.027477 0.027477 1.849670 1.118639 1.465260 0.884099 1.484155 0.759393
2 XGBoost 1.296269 0.029296 1.238851 1.353688 -0.003731 0.003731 1.862596 1.115945 1.467989 0.881809 1.489271 0.729560
3 Random forest 1.280727 0.029370 1.223163 1.338292 -0.019273 0.019273 1.909728 1.146352 1.507496 0.903018 1.528040 4.171908
4 Ridge pipeline 1.313829 0.026528 1.261836 1.365823 0.013829 0.013829 2.022608 1.237154 1.597290 0.960121 1.629881 0.055897
5 LassoCV pipeline 1.312696 0.026267 1.261213 1.364180 0.012696 0.012696 2.023626 1.240213 1.598651 0.961691 1.631920 0.223674

The table makes the trade-off visible. Flexible learners often reduce nuisance RMSE, but the causal estimate does not move one-for-one with prediction error. Orthogonalization makes the estimate less sensitive to small nuisance errors, but it does not make learner choice irrelevant.

Estimate Comparison Plot

The next figure puts the confidence intervals on the same axis. A useful plot for DoubleML learner comparisons should show uncertainty, not only point estimates. Otherwise, small differences between learners can look more meaningful than they are.

estimate_plot_df = learner_comparison.sort_values("theta_hat").copy()

fig, ax = plt.subplots(figsize=(11, 6))
ax.errorbar(
    x=estimate_plot_df["theta_hat"],
    y=estimate_plot_df["learner"],
    xerr=1.96 * estimate_plot_df["se"],
    fmt="o",
    color="#111827",
    ecolor="#475569",
    capsize=4,
    linewidth=1.5,
)
ax.axvline(TRUE_THETA, color="#dc2626", linestyle="--", linewidth=1.6, label="True effect")
ax.axvline(naive_theta, color="#9ca3af", linestyle=":", linewidth=1.5, label="Naive estimate")
ax.set_title("DoubleML PLR Estimates Across Learner Families")
ax.set_xlabel("Estimated treatment effect")
ax.set_ylabel("")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_learner_estimate_comparison.png", bbox_inches="tight")
plt.show()

Most learner choices recover the true effect reasonably well on this design, while the naive estimate sits elsewhere. The plot is a reminder that the main value of DoubleML is not a particular learner; it is the combination of a valid score, cross-fitting, and nuisance adjustment.

Nuisance Quality Versus Estimate Error

This scatterplot connects predictive quality to causal accuracy in the simulation. The x-axis is average nuisance RMSE, the y-axis is absolute bias against the known true effect, and point size reflects runtime.

In real data, the y-axis is unavailable because the true effect is unknown. That is why this plot is pedagogical, not a model-selection recipe.

fig, ax = plt.subplots(figsize=(10, 6))
scatter = sns.scatterplot(
    data=learner_comparison,
    x="avg_nuisance_rmse",
    y="abs_bias_vs_true",
    size="runtime_seconds",
    sizes=(80, 500),
    hue="learner",
    palette="tab10",
    ax=ax,
)
for _, row in learner_comparison.iterrows():
    ax.annotate(
        row["learner"],
        (row["avg_nuisance_rmse"], row["abs_bias_vs_true"]),
        textcoords="offset points",
        xytext=(6, 5),
        fontsize=9,
    )
ax.set_title("Nuisance Prediction Quality And Simulated Effect Bias")
ax.set_xlabel("Average nuisance RMSE")
ax.set_ylabel("Absolute bias vs known true effect")
ax.legend_.remove()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nuisance_rmse_vs_bias.png", bbox_inches="tight")
plt.show()

The relationship is informative but imperfect. Better nuisance models usually help, yet finite-sample variation and regularization choices can still move the final estimate. This is why learner diagnostics should be combined with sensitivity checks and repeated sample-splitting checks in serious work.

Inspect Cross-Fitted Nuisance Predictions

A single RMSE number can hide patterns. Here we compare observed values with cross-fitted nuisance predictions for the LightGBM model. The predictions are out-of-fold predictions created by DoubleML, which makes them the right object to inspect.

selected_name = "LightGBM"
selected_model = fitted_models[selected_name]

prediction_frame = pd.DataFrame(
    {
        "observed_outcome": encoded_df["outcome"].to_numpy(),
        "predicted_outcome_nuisance": selected_model.predictions["ml_l"].squeeze(),
        "observed_treatment": encoded_df["treatment"].to_numpy(),
        "predicted_treatment_nuisance": selected_model.predictions["ml_m"].squeeze(),
    }
)

prediction_quality = pd.DataFrame(
    [
        {
            "learner": selected_name,
            "nuisance": "ml_l outcome model",
            "rmse": rmse_metric(prediction_frame["observed_outcome"], prediction_frame["predicted_outcome_nuisance"]),
            "mae": mean_absolute_error(prediction_frame["observed_outcome"], prediction_frame["predicted_outcome_nuisance"]),
            "r2": r2_score(prediction_frame["observed_outcome"], prediction_frame["predicted_outcome_nuisance"]),
        },
        {
            "learner": selected_name,
            "nuisance": "ml_m treatment model",
            "rmse": rmse_metric(prediction_frame["observed_treatment"], prediction_frame["predicted_treatment_nuisance"]),
            "mae": mean_absolute_error(prediction_frame["observed_treatment"], prediction_frame["predicted_treatment_nuisance"]),
            "r2": r2_score(prediction_frame["observed_treatment"], prediction_frame["predicted_treatment_nuisance"]),
        },
    ]
)

save_table(prediction_quality, f"{NOTEBOOK_PREFIX}_selected_nuisance_prediction_quality.csv")
display(prediction_quality)
learner nuisance rmse mae r2
0 LightGBM ml_l outcome model 1.849670 1.465260 0.428331
1 LightGBM ml_m treatment model 1.118639 0.884099 0.497622

The outcome nuisance is harder than the treatment nuisance in this synthetic design because the outcome contains treatment-driven variation and a nonlinear baseline component. Seeing this split helps decide where tuning effort should go.

Nuisance Prediction Scatterplots

The scatterplots below show where the selected learner fits well and where it struggles. For a real analysis, this type of plot is not proof of causal validity, but it can reveal obvious underfitting or data-quality issues.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.scatterplot(
    data=prediction_frame.sample(700, random_state=RANDOM_SEED),
    x="observed_outcome",
    y="predicted_outcome_nuisance",
    alpha=0.45,
    s=28,
    color="#2563eb",
    ax=axes[0],
)
axes[0].plot(
    [prediction_frame["observed_outcome"].min(), prediction_frame["observed_outcome"].max()],
    [prediction_frame["observed_outcome"].min(), prediction_frame["observed_outcome"].max()],
    color="#111827",
    linestyle="--",
    linewidth=1.2,
)
axes[0].set_title("Outcome Nuisance: Observed vs Cross-Fitted Prediction")
axes[0].set_xlabel("Observed outcome")
axes[0].set_ylabel("Predicted E[Y | X]")

sns.scatterplot(
    data=prediction_frame.sample(700, random_state=RANDOM_SEED + 1),
    x="observed_treatment",
    y="predicted_treatment_nuisance",
    alpha=0.45,
    s=28,
    color="#059669",
    ax=axes[1],
)
axes[1].plot(
    [prediction_frame["observed_treatment"].min(), prediction_frame["observed_treatment"].max()],
    [prediction_frame["observed_treatment"].min(), prediction_frame["observed_treatment"].max()],
    color="#111827",
    linestyle="--",
    linewidth=1.2,
)
axes[1].set_title("Treatment Nuisance: Observed vs Cross-Fitted Prediction")
axes[1].set_xlabel("Observed treatment")
axes[1].set_ylabel("Predicted E[D | X]")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_selected_nuisance_prediction_scatter.png", bbox_inches="tight")
plt.show()

The treatment nuisance predictions are tighter because the treatment equation is easier to learn. The outcome nuisance remains noisier, which is normal when the outcome contains both treatment variation and irreducible noise.

Library-Native Hyperparameter Tuning With Optuna

DoubleML provides tune_ml_models(), which uses Optuna to tune nuisance learners. The parameter spaces are functions that receive an Optuna trial and return a parameter dictionary.

This section tunes a LightGBM learner for both ml_l and ml_m. The search is intentionally small: enough to teach the workflow, not enough to spend minutes chasing marginal improvements.

Important distinction: tuning is based on nuisance prediction scores, here negative root mean squared error. We are not asking Optuna to optimize the treatment-effect estimate.

base_lgbm_for_tuning = LGBMRegressor(
    n_estimators=120,
    random_state=RANDOM_SEED,
    verbose=-1,
    n_jobs=1,
)

def lgbm_param_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 0.02, 0.12, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 8, 40),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 60),
        "subsample": trial.suggest_float("subsample", 0.70, 1.00),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.70, 1.00),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.10, 5.00, log=True),
    }

untuned_lgbm_model, untuned_lgbm_row = fit_plr_with_learner(
    encoded_df,
    learner_name="LightGBM untuned reference",
    ml_l=base_lgbm_for_tuning,
    ml_m=base_lgbm_for_tuning,
    n_folds=5,
    seed=RANDOM_SEED,
)

tuned_lgbm = DoubleMLPLR(
    make_plr_data(encoded_df),
    ml_l=clone(base_lgbm_for_tuning),
    ml_m=clone(base_lgbm_for_tuning),
    n_folds=5,
    draw_sample_splitting=False,
)
tuned_lgbm.set_sample_splitting(make_common_splits(encoded_df, n_folds=5, seed=RANDOM_SEED))

tune_start = time.perf_counter()
tune_result = tuned_lgbm.tune_ml_models(
    ml_param_space={"ml_l": lgbm_param_space, "ml_m": lgbm_param_space},
    scoring_methods={"ml_l": "neg_root_mean_squared_error", "ml_m": "neg_root_mean_squared_error"},
    cv=3,
    set_as_params=True,
    return_tune_res=True,
    optuna_settings={
        "n_trials": 10,
        "direction": "maximize",
        "show_progress_bar": False,
        "sampler": optuna.samplers.TPESampler(seed=RANDOM_SEED),
    },
)
tuning_runtime = time.perf_counter() - tune_start

tuned_fit_start = time.perf_counter()
tuned_lgbm.fit()
tuned_fit_runtime = time.perf_counter() - tuned_fit_start

tuned_rmse = tuned_lgbm.evaluate_learners(metric=rmse_metric)
tuned_mae = tuned_lgbm.evaluate_learners(metric=mae_metric)

tuned_lgbm_row = {
    "learner": "LightGBM tuned with tune_ml_models",
    "theta_hat": float(tuned_lgbm.coef[0]),
    "se": float(tuned_lgbm.se[0]),
    "ci_95_lower": float(tuned_lgbm.confint().iloc[0, 0]),
    "ci_95_upper": float(tuned_lgbm.confint().iloc[0, 1]),
    "bias_vs_true": float(tuned_lgbm.coef[0] - TRUE_THETA),
    "abs_bias_vs_true": float(abs(tuned_lgbm.coef[0] - TRUE_THETA)),
    "rmse_ml_l": float(tuned_rmse["ml_l"][0, 0]),
    "rmse_ml_m": float(tuned_rmse["ml_m"][0, 0]),
    "mae_ml_l": float(tuned_mae["ml_l"][0, 0]),
    "mae_ml_m": float(tuned_mae["ml_m"][0, 0]),
    "avg_nuisance_rmse": float((tuned_rmse["ml_l"][0, 0] + tuned_rmse["ml_m"][0, 0]) / 2),
    "runtime_seconds": tuned_fit_runtime,
    "tuning_runtime_seconds": tuning_runtime,
}

print(f"Tuning runtime: {tuning_runtime:.2f} seconds")
print(f"Final tuned fit runtime: {tuned_fit_runtime:.2f} seconds")
Tuning runtime: 3.72 seconds
Final tuned fit runtime: 0.68 seconds

The tuning cell separates two costs: the Optuna search time and the final DoubleML fit time. In a real workflow, the search budget should be decided before the analysis begins, especially when many learners and outcomes are being compared.

Tuning Result Details

The returned Optuna objects contain the best parameters and the nuisance score for each learner. We flatten them into a table so the chosen hyperparameters become part of the audit trail.

tune_rows = []
for result_bundle in tune_result:
    for learner_key, result in result_bundle.items():
        tune_rows.append(
            {
                "learner_key": learner_key,
                "best_score_neg_rmse": float(result.best_score),
                "best_rmse": float(-result.best_score),
                "best_params": json.dumps(result.best_params, sort_keys=True),
            }
        )

tune_summary = pd.DataFrame(tune_rows)
tuned_comparison = pd.DataFrame([untuned_lgbm_row, tuned_lgbm_row])

save_table(tune_summary, f"{NOTEBOOK_PREFIX}_optuna_tune_summary.csv")
save_table(tuned_comparison, f"{NOTEBOOK_PREFIX}_tuned_vs_untuned_lgbm.csv")

display(tune_summary)
display(tuned_comparison)
learner_key best_score_neg_rmse best_rmse best_params
0 ml_l -1.851373 1.851373 {"colsample_bytree": 0.9896896099223678, "lear...
1 ml_m -1.124470 1.124470 {"colsample_bytree": 0.7094287557060203, "lear...
learner theta_hat se ci_95_lower ci_95_upper bias_vs_true abs_bias_vs_true rmse_ml_l rmse_ml_m mae_ml_l mae_ml_m avg_nuisance_rmse runtime_seconds tuning_runtime_seconds
0 LightGBM untuned reference 1.219688 0.029935 1.161017 1.278359 -0.080312 0.080312 1.858091 1.135388 1.479899 0.896752 1.49674 1.05248 NaN
1 LightGBM tuned with tune_ml_models 1.247427 0.029360 1.189881 1.304972 -0.052573 0.052573 1.825092 1.116527 1.444504 0.878563 1.47081 0.67705 3.717984

The tuned model may or may not dominate the untuned reference on every metric. That is normal. Tuning is a disciplined way to search the learner space, not a guarantee that every finite-sample estimate improves.

Tuned Versus Untuned Plot

This plot compares the LightGBM reference with the tuned LightGBM model. The useful question is not simply which point is closer to the true effect in this simulation. The useful question is whether tuning improves nuisance quality enough to justify the extra complexity and runtime.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

plot_tuned = tuned_comparison.copy()
axes[0].errorbar(
    x=plot_tuned["theta_hat"],
    y=plot_tuned["learner"],
    xerr=1.96 * plot_tuned["se"],
    fmt="o",
    color="#111827",
    ecolor="#475569",
    capsize=4,
)
axes[0].axvline(TRUE_THETA, color="#dc2626", linestyle="--", linewidth=1.5, label="True effect")
axes[0].set_title("Effect Estimates")
axes[0].set_xlabel("Estimated treatment effect")
axes[0].set_ylabel("")
axes[0].legend(loc="best")

bar_df = plot_tuned.melt(
    id_vars="learner",
    value_vars=["rmse_ml_l", "rmse_ml_m"],
    var_name="nuisance",
    value_name="rmse",
)
sns.barplot(data=bar_df, x="rmse", y="learner", hue="nuisance", ax=axes[1], palette=["#2563eb", "#059669"])
axes[1].set_title("Cross-Fitted Nuisance RMSE")
axes[1].set_xlabel("RMSE")
axes[1].set_ylabel("")
axes[1].legend(title="Nuisance")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_tuned_vs_untuned_lgbm.png", bbox_inches="tight")
plt.show()

The side-by-side view keeps causal and predictive diagnostics together. When tuning changes the causal estimate, the next question should be whether the nuisance predictions, overlap, residuals, or sensitivity checks explain why.

Honest Learner Selection Split

When many learner families are compared, there is a risk of choosing the model because the final estimate looks convenient. One practical discipline is to use an exploration split to choose a learner by nuisance criteria, then run the final estimate on a separate estimation split.

This is not mandatory for every DoubleML analysis, and it costs sample size. But it is a useful teaching device for separating model selection from final effect reporting.

exploration_df, estimation_df = train_test_split(
    encoded_df,
    test_size=0.60,
    random_state=RANDOM_SEED,
    shuffle=True,
)

honest_candidate_names = ["LassoCV pipeline", "HistGradientBoosting", "LightGBM", "XGBoost"]
honest_rows = []
for learner_name in honest_candidate_names:
    _, row = fit_plr_with_learner(
        exploration_df.reset_index(drop=True),
        learner_name=learner_name,
        ml_l=learner_registry[learner_name],
        ml_m=learner_registry[learner_name],
        n_folds=3,
        seed=RANDOM_SEED,
    )
    honest_rows.append(row)

exploration_selection = pd.DataFrame(honest_rows).sort_values("avg_nuisance_rmse").reset_index(drop=True)
selected_honest_learner = exploration_selection.loc[0, "learner"]

final_honest_model, final_honest_row = fit_plr_with_learner(
    estimation_df.reset_index(drop=True),
    learner_name=f"Honest selected: {selected_honest_learner}",
    ml_l=learner_registry[selected_honest_learner],
    ml_m=learner_registry[selected_honest_learner],
    n_folds=5,
    seed=RANDOM_SEED + 7,
)

honest_final_table = pd.DataFrame([final_honest_row])

save_table(exploration_selection, f"{NOTEBOOK_PREFIX}_honest_exploration_selection.csv")
save_table(honest_final_table, f"{NOTEBOOK_PREFIX}_honest_final_estimate.csv")

display(exploration_selection)
display(honest_final_table)
learner theta_hat se ci_95_lower ci_95_upper bias_vs_true abs_bias_vs_true rmse_ml_l rmse_ml_m mae_ml_l mae_ml_m avg_nuisance_rmse runtime_seconds
0 XGBoost 1.313994 0.039975 1.235644 1.392344 0.013994 0.013994 1.906027 1.151033 1.506946 0.913160 1.528530 0.314751
1 LightGBM 1.276176 0.040285 1.197220 1.355133 -0.023824 0.023824 1.936377 1.195060 1.528350 0.950518 1.565718 0.151945
2 HistGradientBoosting 1.254986 0.043235 1.170248 1.339725 -0.045014 0.045014 1.959175 1.207533 1.547991 0.965919 1.583354 0.946530
3 LassoCV pipeline 1.337713 0.037976 1.263281 1.412144 0.037713 0.037713 1.990969 1.204897 1.586644 0.955102 1.597933 0.237831
learner theta_hat se ci_95_lower ci_95_upper bias_vs_true abs_bias_vs_true rmse_ml_l rmse_ml_m mae_ml_l mae_ml_m avg_nuisance_rmse runtime_seconds
0 Honest selected: XGBoost 1.303025 0.038108 1.228335 1.377714 0.003025 0.003025 1.918942 1.148045 1.518586 0.897994 1.533493 0.582215

The exploration split chooses a learner using nuisance RMSE, then the estimation split reports the treatment effect. This is more conservative than selecting the final model after browsing all causal estimates.

Unsafe Tuning Patterns

The next table is not code for fitting a model. It is a checklist of habits to avoid. These mistakes are common because they feel like ordinary machine-learning iteration, but they weaken the credibility of causal reporting.

unsafe_patterns = pd.DataFrame(
    [
        {
            "pattern": "Choose the learner whose theta is closest to a preferred answer",
            "why_it_is_unsafe": "The final estimate becomes part of the selection rule, so uncertainty is understated and the design is no longer pre-specified.",
            "safer_alternative": "Choose learners using nuisance prediction, stability rules, or an honest exploration split.",
        },
        {
            "pattern": "Tune on the full analysis until confidence intervals become significant",
            "why_it_is_unsafe": "Repeated searching changes the meaning of the reported interval.",
            "safer_alternative": "Fix the tuning budget and report sensitivity to reasonable alternatives.",
        },
        {
            "pattern": "Preprocess using outcome information outside a fold",
            "why_it_is_unsafe": "Information from held-out rows can leak into nuisance predictions.",
            "safer_alternative": "Put preprocessing inside sklearn pipelines or build reproducible pre-treatment features only.",
        },
        {
            "pattern": "Report only the best-looking learner",
            "why_it_is_unsafe": "Readers cannot see whether the result is stable across defensible choices.",
            "safer_alternative": "Show a compact learner sensitivity table and explain the chosen primary specification.",
        },
    ]
)

save_table(unsafe_patterns, f"{NOTEBOOK_PREFIX}_unsafe_tuning_patterns.csv")
display(unsafe_patterns)
pattern why_it_is_unsafe safer_alternative
0 Choose the learner whose theta is closest to a... The final estimate becomes part of the selecti... Choose learners using nuisance prediction, sta...
1 Tune on the full analysis until confidence int... Repeated searching changes the meaning of the ... Fix the tuning budget and report sensitivity t...
2 Preprocess using outcome information outside a... Information from held-out rows can leak into n... Put preprocessing inside sklearn pipelines or ...
3 Report only the best-looking learner Readers cannot see whether the result is stabl... Show a compact learner sensitivity table and e...

The safe alternative is not to avoid tuning. The safe alternative is to make tuning auditable: define the search space, define the score, control the search budget, and show robustness to plausible learner choices.

Reporting Checklist

A good learner-tuning section in a causal notebook should be short but complete. The table below is a reusable reporting checklist.

reporting_checklist = pd.DataFrame(
    [
        {"item": "State the causal estimand before learner selection", "status": "required"},
        {"item": "List the nuisance learners and their targets", "status": "required"},
        {"item": "Describe preprocessing and confirm controls are pre-treatment", "status": "required"},
        {"item": "Report cross-fitting folds and repeated-split choices", "status": "required"},
        {"item": "Show nuisance prediction diagnostics", "status": "recommended"},
        {"item": "Show effect stability across a small set of defensible learners", "status": "recommended"},
        {"item": "Document tuning search spaces, scores, and trial budgets", "status": "required if tuned"},
        {"item": "Avoid choosing models based on desired treatment-effect estimates", "status": "required"},
    ]
)

save_table(reporting_checklist, f"{NOTEBOOK_PREFIX}_learner_tuning_reporting_checklist.csv")
display(reporting_checklist)
item status
0 State the causal estimand before learner selec... required
1 List the nuisance learners and their targets required
2 Describe preprocessing and confirm controls ar... required
3 Report cross-fitting folds and repeated-split ... required
4 Show nuisance prediction diagnostics recommended
5 Show effect stability across a small set of de... recommended
6 Document tuning search spaces, scores, and tri... required if tuned
7 Avoid choosing models based on desired treatme... required

This checklist is intentionally practical. It turns learner choice from a hidden modeling preference into a reproducible part of the causal analysis.

Report Template

The final cell writes a short report template. The template is not meant to replace analysis; it is meant to keep the final write-up disciplined and consistent.

report_text = f"""
# Learner And Tuning Report Template

## Estimand
- Treatment:
- Outcome:
- Control set:
- Target population:
- Identification assumptions:

## Nuisance Learners
- Primary `ml_l` learner:
- Primary `ml_m` learner:
- Reason for primary learner choice:
- Alternative learners checked:

## Preprocessing
- Numeric controls:
- Encoded categorical controls:
- Excluded columns:
- Pipeline steps:

## Cross-Fitting And Tuning
- Number of folds:
- Repeated sample splitting:
- Tuning method:
- Tuning score:
- Search budget:
- Search space summary:

## Diagnostics
- Outcome nuisance RMSE:
- Treatment nuisance RMSE:
- Estimate stability across learners:
- Runtime considerations:

## Final Estimate
- Point estimate:
- Standard error:
- Confidence interval:
- Main caveats:
""".strip()

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_learner_tuning_report_template.md"
report_path.write_text(report_text)

artifact_manifest = pd.DataFrame(
    [
        {"artifact": "raw synthetic data", "path": str(DATASET_DIR / f"{NOTEBOOK_PREFIX}_raw_synthetic_plr_tuning_data.csv")},
        {"artifact": "encoded synthetic data", "path": str(DATASET_DIR / f"{NOTEBOOK_PREFIX}_encoded_synthetic_plr_tuning_data.csv")},
        {"artifact": "learner comparison", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_learner_comparison.csv")},
        {"artifact": "Optuna tuning summary", "path": str(TABLE_DIR / f"{NOTEBOOK_PREFIX}_optuna_tune_summary.csv")},
        {"artifact": "report template", "path": str(report_path)},
        {"artifact": "workflow figure", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_safe_tuning_workflow.png")},
        {"artifact": "estimate comparison figure", "path": str(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_learner_estimate_comparison.png")},
    ]
)

save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")
display(Markdown(f"Report template written to `{report_path}`"))
display(artifact_manifest)

Report template written to /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/10_learner_tuning_report_template.md

artifact path
0 raw synthetic data /home/apex/Documents/ranking_sys/notebooks/tut...
1 encoded synthetic data /home/apex/Documents/ranking_sys/notebooks/tut...
2 learner comparison /home/apex/Documents/ranking_sys/notebooks/tut...
3 Optuna tuning summary /home/apex/Documents/ranking_sys/notebooks/tut...
4 report template /home/apex/Documents/ranking_sys/notebooks/tut...
5 workflow figure /home/apex/Documents/ranking_sys/notebooks/tut...
6 estimate comparison figure /home/apex/Documents/ranking_sys/notebooks/tut...

The notebook now has all pieces needed for a learner-tuning tutorial: a known synthetic target, learner-family comparison, nuisance diagnostics, library-native tuning, an honest-selection pattern, and reporting artifacts.

What Comes Next

The natural next topic is sample splitting itself: how cross-fitting works, what repeated cross-fitting changes, how to inspect split stability, and when repeated sample splits are worth the extra runtime.