DoubleML Tutorial 08: Sample Selection Models

This notebook teaches sample-selection models with DoubleML. The central problem is outcome attrition: we want a causal effect for a target population, but the outcome is observed only for a selected subset of rows.

This comes up constantly in applied data work. A user may only leave a rating if they engaged. A customer satisfaction survey may only be answered by a nonrandom subset. A downstream business outcome may only be measured after a user completes an earlier funnel step. If we estimate treatment effects only on the selected rows, the result can describe the selected sample rather than the full target population.

DoubleMLSSM handles this setting by combining treatment propensity learning, selection-probability learning, and outcome regression. The machine learning helps estimate nuisance functions flexibly; the causal validity still comes from assumptions about treatment assignment and outcome observation.

Learning Goals

By the end of this notebook, you should be able to:

  • identify when selected outcomes threaten causal effect estimation;
  • distinguish treatment selection from sample selection;
  • explain missing-at-random selection and nonignorable selection;
  • build DoubleMLSSMData with outcome, treatment, covariates, selection indicator, and optional selection instrument;
  • fit DoubleMLSSM with linear and tree-based nuisance learners;
  • inspect outcome, treatment, and selection nuisance predictions;
  • understand why unselected rows must remain in the dataset;
  • report selection assumptions, overlap concerns, and sensitivity risks clearly.

The Selection Problem

A standard binary-treatment ATE problem has an outcome Y, treatment D, and covariates X. Sample selection adds another variable, S:

S = 1 if the outcome is observed
S = 0 if the outcome is missing or not measured

The dangerous shortcut is to drop rows with S = 0 and run a treatment-effect model on selected rows only. That selected-row analysis answers a different question unless selection is completely unrelated to potential outcomes and treatment effects.

Sample selection is not the same as treatment selection. Treatment selection asks why D differs across rows. Sample selection asks why Y is observed for some rows and not others. In selected-outcome problems, we often need nuisance models for both.

Missing-At-Random Selection

The missing-at-random version assumes that, after conditioning on treatment and observed covariates, whether the outcome is observed does not depend on the missing potential outcomes.

In words:

Outcome observation can depend on D and X,
but not on hidden outcome shocks once D and X are fixed.

This is a strong assumption, but it can be plausible in some settings. For example, if survey response depends on known user activity, tenure, device type, and treatment status, and those variables capture the relevant response process, a missing-at-random correction may be credible.

The key nuisance functions are:

  • g(d, X) = E[Y | S = 1, D = d, X], the outcome model among selected rows;
  • pi(D, X) = P(S = 1 | D, X), the selection probability;
  • m(X) = P(D = 1 | X), the treatment propensity.

DoubleML combines these through an orthogonal score so that nuisance estimation error has less first-order impact on the final treatment-effect estimate.

Nonignorable Selection

Sometimes outcome observation depends on hidden factors that also affect the outcome. This is nonignorable selection. A satisfaction survey may be more likely to be answered by unusually happy or unusually unhappy users even after conditioning on observed covariates. In that case, missing-at-random adjustment can still be biased.

DoubleMLSSM also supports a nonignorable score. That setup requires an instrument-like variable for selection: a variable that affects whether the outcome is observed but does not directly affect the outcome itself. In this notebook we call it selection_encouragement.

The exclusion restriction is doing real causal work. The package can use the variable, but it cannot prove that the variable affects selection only through observation and not through the outcome.

What DoubleML Adds

The sample-selection score is a doubly robust style construction. For the missing-at-random case, the treatment arm contribution looks like:

D * S * (Y - g(1, X)) / [m(X) * pi(1, X)] + g(1, X)

The control arm has the analogous term with 1 - D and g(0, X). The ATE estimate is the average treatment contribution minus the average control contribution.

This structure explains the three nuisance roles:

  • the outcome model fills in expected outcomes;
  • the treatment propensity adjusts for treatment assignment;
  • the selection probability upweights observed outcomes to represent rows whose outcomes are missing.

Cross-fitting keeps nuisance predictions out of sample. Orthogonality makes the final estimate less sensitive to small nuisance-model errors, but it does not remove the need for selection assumptions and overlap.

Runtime Note

This notebook fits several sample-selection models, including stress tests and a nonignorable-selection example. A full run should take about 2 to 4 minutes on a typical laptop.

Setup

The setup cell imports the scientific Python stack, configures output folders, and suppresses only the narrow warnings that would clutter tutorial output. Code is kept visible throughout the notebook.

from pathlib import Path
import os
import warnings

PROJECT_ROOT = Path.cwd().resolve()
for candidate in [PROJECT_ROOT, *PROJECT_ROOT.parents]:
    if (candidate / "pyproject.toml").exists():
        PROJECT_ROOT = candidate
        break

OUTPUT_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MATPLOTLIB_CACHE_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="IProgress not found.*", category=Warning)
warnings.filterwarnings("ignore", message=".*trimming.*deprecated.*", category=DeprecationWarning)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

import sklearn
from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import statsmodels.formula.api as smf

import doubleml as dml
from doubleml import DoubleMLSSM, DoubleMLSSMData

NOTEBOOK_PREFIX = "08"
RANDOM_SEED = 42
sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams.update({"figure.dpi": 120, "savefig.dpi": 160})

package_versions = pd.DataFrame(
    [
        {"package": "python", "version": os.sys.version.split()[0]},
        {"package": "doubleml", "version": dml.__version__},
        {"package": "numpy", "version": np.__version__},
        {"package": "pandas", "version": pd.__version__},
        {"package": "scikit-learn", "version": sklearn.__version__},
        {"package": "matplotlib", "version": plt.matplotlib.__version__},
        {"package": "seaborn", "version": sns.__version__},
    ]
)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)
package version
0 python 3.13.12
1 doubleml 0.11.2
2 numpy 2.4.4
3 pandas 3.0.2
4 scikit-learn 1.6.1
5 matplotlib 3.10.9
6 seaborn 0.13.2

The version table is part of the reproducibility record. The installed version exposes DoubleMLSSM and DoubleMLSSMData, which are the two classes used in this tutorial.

Helper Functions

The helpers below keep the notebook readable. They save tables, construct nuisance learners, extract DoubleML predictions, and compute compact diagnostics. None of them changes the causal estimand.

def save_table(df, file_name, index=False):
    """Save a table under the notebook output folder and return the DataFrame."""
    path = TABLE_DIR / file_name
    df.to_csv(path, index=index)
    return df


def sigmoid(x):
    """Logistic transform used for synthetic probabilities."""
    return 1.0 / (1.0 + np.exp(-x))


def rmse(y_true, y_pred):
    """Root mean squared error with one-dimensional arrays."""
    return float(np.sqrt(mean_squared_error(np.asarray(y_true).ravel(), np.asarray(y_pred).ravel())))


def clip_probability(values, lower=0.01, upper=0.99):
    """Clip probabilities away from exact 0 and 1 for stable inverse weights."""
    return np.clip(np.asarray(values, dtype=float), lower, upper)


def predict_probability(model, X):
    """Return P(class=1) from a classifier."""
    return model.predict_proba(X)[:, 1]


def make_linear_learners():
    """Low-variance nuisance learners for mostly linear synthetic designs."""
    ml_g = make_pipeline(StandardScaler(), LinearRegression())
    ml_pi = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1_000))
    ml_m = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1_000))
    return ml_g, ml_pi, ml_m


def make_hgb_learners(seed=RANDOM_SEED, max_iter=80):
    """Tree-based nuisance learners for more flexible adjustment."""
    ml_g = HistGradientBoostingRegressor(
        max_iter=max_iter,
        max_leaf_nodes=15,
        min_samples_leaf=30,
        learning_rate=0.06,
        random_state=seed,
    )
    ml_pi = HistGradientBoostingClassifier(
        max_iter=max_iter,
        max_leaf_nodes=15,
        min_samples_leaf=30,
        learning_rate=0.06,
        random_state=seed + 1,
    )
    ml_m = HistGradientBoostingClassifier(
        max_iter=max_iter,
        max_leaf_nodes=15,
        min_samples_leaf=30,
        learning_rate=0.06,
        random_state=seed + 2,
    )
    return ml_g, ml_pi, ml_m


def prediction_vector(model, key):
    """Extract one DoubleML prediction array as a flat vector."""
    return np.asarray(model.predictions[key]).reshape(-1)


def summarize_doubleml(model, label, design, true_target):
    """Create a compact one-row summary from a fitted DoubleML object."""
    ci = model.confint(level=0.95).iloc[0]
    theta = float(model.coef[0])
    se = float(model.se[0])
    return {
        "estimator": label,
        "design": design,
        "theta_hat": theta,
        "std_error": se,
        "ci_95_lower": float(ci.iloc[0]),
        "ci_95_upper": float(ci.iloc[1]),
        "true_target": float(true_target),
        "bias_vs_target": theta - float(true_target),
    }


def learner_loss_table(model, model_label):
    """Convert DoubleML learner losses to a tidy table."""
    losses = model.evaluate_learners()
    rows = []
    for learner_name, values in losses.items():
        rows.append(
            {
                "model": model_label,
                "learner": learner_name,
                "metric_value": float(np.asarray(values).mean()),
                "metric": "RMSE for outcome learners; classification error style score for propensity learners",
            }
        )
    return pd.DataFrame(rows)


def add_ci_columns(df):
    """Prepare lower and upper error-bar columns for plotting."""
    out = df.copy()
    out["lower_error"] = out["theta_hat"] - out["ci_95_lower"]
    out["upper_error"] = out["ci_95_upper"] - out["theta_hat"]
    return out

The helper functions make the model code easier to read. The key idea to keep in mind is that SSM has three nuisance tasks: outcome regression, selection-probability modeling, and treatment-propensity modeling.

Teaching Diagram

The diagram separates treatment assignment from outcome observation. D determines which potential outcome is relevant. S determines whether we see that outcome. Both can depend on observed controls, and selection can also depend on treatment.

from matplotlib.patches import FancyArrowPatch, FancyBboxPatch

nodes = {
    "X": {"xy": (0.10, 0.54), "label": "Observed\ncontrols X", "color": "#dbeafe"},
    "D": {"xy": (0.34, 0.74), "label": "Treatment\nD", "color": "#dcfce7"},
    "S": {"xy": (0.34, 0.30), "label": "Outcome\nobserved S", "color": "#fef3c7"},
    "Y": {"xy": (0.64, 0.58), "label": "Outcome\nY", "color": "#fee2e2"},
    "Z": {"xy": (0.62, 0.22), "label": "Selection\nencouragement Z", "color": "#e0f2fe"},
    "U": {"xy": (0.60, 0.88), "label": "Hidden response\nfactor U", "color": "#f3f4f6"},
    "O": {"xy": (0.90, 0.52), "label": "Observed data\nS x Y", "color": "#ede9fe"},
}

fig, ax = plt.subplots(figsize=(12, 6))
ax.set_axis_off()
box_w, box_h = 0.15, 0.11
arrow_gap = 0.018


def anchor(node, side):
    x, y = nodes[node]["xy"]
    offsets = {
        "left": (-box_w / 2, 0),
        "right": (box_w / 2, 0),
        "top": (0, box_h / 2),
        "bottom": (0, -box_h / 2),
        "upper_right": (box_w / 2, box_h * 0.25),
        "lower_right": (box_w / 2, -box_h * 0.25),
        "upper_left": (-box_w / 2, box_h * 0.25),
        "lower_left": (-box_w / 2, -box_h * 0.25),
    }
    dx, dy = offsets[side]
    return (x + dx, y + dy)


def shorten_segment(start, end, gap=arrow_gap):
    """Move arrow endpoints inward so arrowheads do not sit on top of boxes."""
    start = np.asarray(start, dtype=float)
    end = np.asarray(end, dtype=float)
    delta = end - start
    length = np.hypot(delta[0], delta[1])
    if length == 0:
        return tuple(start), tuple(end)
    unit = delta / length
    return tuple(start + gap * unit), tuple(end - gap * unit)


def shorten_polyline(points, gap=arrow_gap):
    """Shorten only the first and final endpoints of a routed arrow."""
    pts = [tuple(point) for point in points]
    if len(pts) < 2:
        return pts
    pts[0], _ = shorten_segment(pts[0], pts[1], gap=gap)
    _, pts[-1] = shorten_segment(pts[-2], pts[-1], gap=gap)
    return pts


def draw_arrow(start, end, color, style="solid", rad=0.0, linewidth=1.7, gap=arrow_gap):
    start, end = shorten_segment(start, end, gap=gap)
    arrow = FancyArrowPatch(
        start,
        end,
        arrowstyle="-|>",
        mutation_scale=18,
        linewidth=linewidth,
        color=color,
        linestyle=style,
        shrinkA=0,
        shrinkB=0,
        connectionstyle=f"arc3,rad={rad}",
        zorder=5,
    )
    ax.add_patch(arrow)


def draw_routed_arrow(points, color, style="solid", linewidth=1.7, gap=arrow_gap):
    pts = shorten_polyline(points, gap=gap)
    for start, end in zip(pts[:-2], pts[1:-1]):
        ax.plot([start[0], end[0]], [start[1], end[1]], color=color, linestyle=style, linewidth=linewidth, zorder=5)
    draw_arrow(pts[-2], pts[-1], color=color, style=style, linewidth=linewidth, gap=0.0)

# Main observed paths.
draw_arrow(anchor("X", "upper_right"), anchor("D", "left"), color="#334155")
draw_arrow(anchor("X", "lower_right"), anchor("S", "left"), color="#334155")
draw_routed_arrow([anchor("X", "right"), (0.38, 0.56), anchor("Y", "left")], color="#334155")
draw_arrow(anchor("D", "right"), anchor("Y", "upper_left"), color="#15803d")
draw_arrow(anchor("D", "bottom"), anchor("S", "top"), color="#b45309")
draw_arrow(anchor("Z", "left"), anchor("S", "lower_right"), color="#0369a1")
draw_arrow(anchor("Y", "right"), anchor("O", "upper_left"), color="#334155")
draw_arrow(anchor("S", "right"), anchor("O", "lower_left"), color="#b45309")

# Dashed paths are separated so the nonignorable-selection risk is visible without crossing boxes.
draw_arrow(anchor("U", "lower_right"), anchor("Y", "top"), color="#6b7280", style="dashed", linewidth=1.5)
draw_routed_arrow([anchor("U", "lower_left"), (0.48, 0.66), anchor("S", "upper_right")], color="#6b7280", style="dashed", linewidth=1.5)

for spec in nodes.values():
    x, y = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.018",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)

ax.text(
    0.50,
    0.08,
    "SSM estimates a treatment effect when Y is observed only for selected rows. The selection assumptions are part of the design.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("Sample Selection Model Teaching Design", pad=18)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_ssm_design_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The dashed paths show the nonignorable-selection risk. If a hidden response factor affects both outcome observation and the outcome itself, a missing-at-random correction can fail. The selection encouragement Z is useful only if it shifts observation without directly shifting the outcome.

Synthetic Missing-At-Random Data

The first dataset is a missing-at-random design. Selection depends on treatment and observed covariates, including a selection encouragement signal. There is no hidden response factor that directly affects both selection and the outcome.

Because this is synthetic data, we keep a full_outcome column for evaluation. In real selected-outcome data, that full outcome would not be available for unselected rows.

rng = np.random.default_rng(RANDOM_SEED)
N_MAR = 6_000

engagement_score = rng.normal(0, 1, N_MAR)
baseline_value = rng.normal(0, 1, N_MAR)
support_need = rng.normal(0, 1, N_MAR)
prior_usage = rng.normal(0, 1, N_MAR)
mobile_user = rng.binomial(1, 0.48, N_MAR)
selection_encouragement = rng.normal(0, 1, N_MAR)

mar_x = pd.DataFrame(
    {
        "engagement_score": engagement_score,
        "baseline_value": baseline_value,
        "support_need": support_need,
        "prior_usage": prior_usage,
        "mobile_user": mobile_user,
        "selection_encouragement": selection_encouragement,
    }
)
feature_cols_mar = list(mar_x.columns)

true_treatment_propensity = sigmoid(
    -0.10
    + 0.55 * engagement_score
    + 0.35 * baseline_value
    - 0.30 * support_need
    + 0.18 * mobile_user
)
treated = rng.binomial(1, true_treatment_propensity)

true_tau = 1.00 + 0.25 * engagement_score + 0.15 * baseline_value - 0.10 * support_need + 0.08 * mobile_user
mu0 = 3.00 + 0.80 * engagement_score + 0.50 * baseline_value - 0.30 * support_need + 0.25 * prior_usage + 0.20 * mobile_user
full_outcome = mu0 + true_tau * treated + rng.normal(0, 0.70, N_MAR)

true_selection_probability = sigmoid(
    -0.25
    + 0.55 * treated
    + 0.45 * engagement_score
    - 0.55 * support_need
    + 0.25 * prior_usage
    + 0.70 * selection_encouragement
)
selected = rng.binomial(1, true_selection_probability)

# DoubleMLSSM expects a numeric outcome column for every row. The selection indicator tells the estimator
# which outcomes are genuinely observed and should enter the selected-outcome regressions.
outcome_observed = full_outcome.copy()
outcome_observed[selected == 0] = 0.0

mar_df = pd.concat(
    [
        pd.DataFrame({"row_id": np.arange(N_MAR)}),
        mar_x,
        pd.DataFrame(
            {
                "treated": treated,
                "selected": selected,
                "outcome_observed": outcome_observed,
                "full_outcome": full_outcome,
                "true_tau": true_tau,
                "true_treatment_propensity": true_treatment_propensity,
                "true_selection_probability": true_selection_probability,
            }
        ),
    ],
    axis=1,
)

TRUE_MAR_ATE = mar_df["true_tau"].mean()
TRUE_MAR_ATT = mar_df.loc[mar_df["treated"] == 1, "true_tau"].mean()

mar_path = DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_mar_selection_data.csv"
mar_df.to_csv(mar_path, index=False)

print(f"Rows: {len(mar_df):,}")
print(f"Selected share: {mar_df['selected'].mean():.3f}")
print(f"True MAR ATE: {TRUE_MAR_ATE:.4f}")
print(f"True MAR ATT: {TRUE_MAR_ATT:.4f}")
print(f"Saved data to: {mar_path.relative_to(PROJECT_ROOT)}")
display(mar_df.head())
Rows: 6,000
Selected share: 0.518
True MAR ATE: 1.0323
True MAR ATT: 1.1265
Saved data to: notebooks/tutorials/doubleml/outputs/datasets/08_synthetic_mar_selection_data.csv
row_id engagement_score baseline_value support_need prior_usage mobile_user selection_encouragement treated selected outcome_observed full_outcome true_tau true_treatment_propensity true_selection_probability
0 0 0.304717 0.225117 -0.309788 0.636265 0 0.178007 1 0 0.000000 3.855320 1.140926 0.559546 0.709133
1 1 -1.039984 -0.053343 0.060236 1.579594 1 0.177642 1 1 2.795688 2.795688 0.805979 0.370813 0.578867
2 2 0.750451 0.864044 -1.240756 -0.668791 1 0.985953 1 1 6.691277 6.691277 1.521295 0.762673 0.863314
3 3 0.940565 0.606733 -1.387754 -0.545453 0 -0.306807 1 1 6.255977 6.255977 1.464927 0.740005 0.756832
4 4 -1.951035 -2.985823 0.182298 -0.208867 1 0.442759 1 0 0.000000 1.885728 0.126138 0.109799 0.396394

The selected share is neither tiny nor near one, so the notebook has enough observed and unobserved rows to show the selection problem. The target for DoubleMLSSM is the ATE over the full population, not just the selected rows.

Field Dictionary

A field dictionary is especially important for selected-outcome problems because it prevents a common mistake: treating the selection indicator as an ordinary covariate or dropping unselected rows before the selection model is built.

mar_field_dictionary = pd.DataFrame(
    [
        {"field": "row_id", "role": "identifier", "description": "Unique row identifier."},
        {"field": "treated", "role": "treatment", "description": "Binary treatment indicator D."},
        {"field": "selected", "role": "selection indicator", "description": "Equals 1 if the outcome is observed."},
        {"field": "outcome_observed", "role": "observed outcome", "description": "Outcome column passed to DoubleMLSSM; unselected rows are filled with 0 and marked by selected=0."},
        {"field": "full_outcome", "role": "simulation truth", "description": "Outcome for every row, kept only because this is synthetic data."},
        {"field": "engagement_score", "role": "covariate", "description": "Observed pre-treatment feature related to treatment, selection, and outcome."},
        {"field": "baseline_value", "role": "covariate", "description": "Observed baseline outcome predictor."},
        {"field": "support_need", "role": "covariate", "description": "Observed need/risk feature."},
        {"field": "prior_usage", "role": "covariate", "description": "Observed prior activity signal."},
        {"field": "mobile_user", "role": "covariate", "description": "Observed binary segment marker."},
        {"field": "selection_encouragement", "role": "covariate in MAR design", "description": "Observed signal that predicts outcome observation."},
        {"field": "true_tau", "role": "simulation truth", "description": "Individual treatment effect."},
        {"field": "true_treatment_propensity", "role": "simulation truth", "description": "True P(D=1|X)."},
        {"field": "true_selection_probability", "role": "simulation truth", "description": "True P(S=1|D,X)."},
    ]
)
save_table(mar_field_dictionary, f"{NOTEBOOK_PREFIX}_mar_field_dictionary.csv")
display(mar_field_dictionary)
field role description
0 row_id identifier Unique row identifier.
1 treated treatment Binary treatment indicator D.
2 selected selection indicator Equals 1 if the outcome is observed.
3 outcome_observed observed outcome Outcome column passed to DoubleMLSSM; unselect...
4 full_outcome simulation truth Outcome for every row, kept only because this ...
5 engagement_score covariate Observed pre-treatment feature related to trea...
6 baseline_value covariate Observed baseline outcome predictor.
7 support_need covariate Observed need/risk feature.
8 prior_usage covariate Observed prior activity signal.
9 mobile_user covariate Observed binary segment marker.
10 selection_encouragement covariate in MAR design Observed signal that predicts outcome observat...
11 true_tau simulation truth Individual treatment effect.
12 true_treatment_propensity simulation truth True P(D=1|X).
13 true_selection_probability simulation truth True P(S=1|D,X).

Notice that selected is not part of the ordinary feature list. It has a special role through s_col, because it determines which outcomes are observed.

Data Audit

Before fitting a sample-selection model, check selection rate, treatment rate, selected-by-treatment cells, and missingness. The selection cells matter because outcome models are trained only where S = 1 within treatment arms.

mar_audit = pd.DataFrame(
    [
        {"metric": "rows", "value": len(mar_df)},
        {"metric": "columns", "value": mar_df.shape[1]},
        {"metric": "missing_cells", "value": int(mar_df.isna().sum().sum())},
        {"metric": "treated_share", "value": mar_df["treated"].mean()},
        {"metric": "selected_share", "value": mar_df["selected"].mean()},
        {"metric": "selected_share_treated", "value": mar_df.loc[mar_df["treated"] == 1, "selected"].mean()},
        {"metric": "selected_share_control", "value": mar_df.loc[mar_df["treated"] == 0, "selected"].mean()},
        {"metric": "true_ate", "value": TRUE_MAR_ATE},
    ]
)
mar_cell_counts = mar_df.groupby(["treated", "selected"]).size().rename("n").reset_index()
save_table(mar_audit, f"{NOTEBOOK_PREFIX}_mar_data_audit.csv")
save_table(mar_cell_counts, f"{NOTEBOOK_PREFIX}_mar_treatment_selection_cells.csv")
display(mar_audit)
display(mar_cell_counts)
metric value
0 rows 6000.000000
1 columns 14.000000
2 missing_cells 0.000000
3 treated_share 0.492167
4 selected_share 0.518333
5 selected_share_treated 0.617338
6 selected_share_control 0.422383
7 true_ate 1.032326
treated selected n
0 0 0 1760
1 0 1 1287
2 1 0 1130
3 1 1 1823

Every treatment-selection cell has enough rows. If selected treated rows or selected control rows were rare, the outcome nuisance functions g(1, X) or g(0, X) would be weak.

Selection Rates And Outcome Visibility

The next plot shows two different facts at once: selected shares differ by treatment status, and observed selected outcomes differ from the full outcome distribution. In real data, the full distribution would be hidden, which is exactly why the design assumption matters.

selection_rate = (
    mar_df.groupby("treated")
    .agg(selected_rate=("selected", "mean"), n=("selected", "size"))
    .reset_index()
)

plot_outcomes = pd.concat(
    [
        mar_df.assign(outcome_source="full population", outcome_for_plot=mar_df["full_outcome"]),
        mar_df.loc[mar_df["selected"] == 1].assign(
            outcome_source="selected rows only",
            outcome_for_plot=mar_df.loc[mar_df["selected"] == 1, "outcome_observed"],
        ),
    ],
    ignore_index=True,
)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
sns.barplot(data=selection_rate, x="treated", y="selected_rate", color="#60a5fa", ax=axes[0])
axes[0].set_title("Selection Rate By Treatment")
axes[0].set_xlabel("Treatment D")
axes[0].set_ylabel("P(S=1)")
axes[0].set_ylim(0, 1)

sns.kdeplot(
    data=plot_outcomes,
    x="outcome_for_plot",
    hue="outcome_source",
    common_norm=False,
    linewidth=2,
    ax=axes[1],
)
axes[1].set_title("Full Outcome Versus Selected Observed Outcome")
axes[1].set_xlabel("Outcome")
axes[1].set_ylabel("Density")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_mar_selection_and_outcome_visibility.png", dpi=160, bbox_inches="tight")
plt.show()

display(selection_rate)

treated selected_rate n
0 0 0.422383 3047
1 1 0.617338 2953

Treatment affects selection, and selected rows have a shifted outcome distribution. A selected-row-only analysis is therefore risky because it changes the population being analyzed.

Covariate Balance And Selection Balance

Selection can distort the covariate distribution. The table below compares selected and unselected rows on observed features. Large differences do not make correction impossible, but they make the selection model more important.

balance_rows = []
for col in feature_cols_mar:
    selected_mean = mar_df.loc[mar_df["selected"] == 1, col].mean()
    unselected_mean = mar_df.loc[mar_df["selected"] == 0, col].mean()
    pooled_sd = mar_df[col].std(ddof=1)
    balance_rows.append(
        {
            "feature": col,
            "selected_mean": selected_mean,
            "unselected_mean": unselected_mean,
            "difference": selected_mean - unselected_mean,
            "standardized_difference": (selected_mean - unselected_mean) / pooled_sd,
        }
    )

selection_balance = pd.DataFrame(balance_rows).sort_values("standardized_difference", key=np.abs, ascending=False)
save_table(selection_balance, f"{NOTEBOOK_PREFIX}_mar_selection_covariate_balance.csv")
display(selection_balance)
feature selected_mean unselected_mean difference standardized_difference
5 selection_encouragement 0.279948 -0.288673 0.568620 0.567095
2 support_need -0.208256 0.262695 -0.470952 -0.470948
0 engagement_score 0.198322 -0.230804 0.429127 0.427795
3 prior_usage 0.109263 -0.048825 0.158088 0.158684
1 baseline_value -0.010164 -0.020088 0.009924 0.009853
4 mobile_user 0.483280 0.479585 0.003695 0.007394

The selected sample is not a random slice of the full sample. The encouragement and support-need variables are especially important for modeling selection.

Naive Baselines

The first baseline uses only selected rows and compares treated versus control outcomes. The second baseline uses the synthetic full outcome and is shown only for teaching. In real selected-outcome data, the second baseline cannot be computed.

selected_df = mar_df.loc[mar_df["selected"] == 1].copy()

naive_selected_diff = (
    selected_df.loc[selected_df["treated"] == 1, "outcome_observed"].mean()
    - selected_df.loc[selected_df["treated"] == 0, "outcome_observed"].mean()
)
full_population_diff = (
    mar_df.loc[mar_df["treated"] == 1, "full_outcome"].mean()
    - mar_df.loc[mar_df["treated"] == 0, "full_outcome"].mean()
)

selected_ols = smf.ols(
    formula="outcome_observed ~ treated + " + " + ".join(feature_cols_mar),
    data=selected_df,
).fit(cov_type="HC3")

baseline_summary = pd.DataFrame(
    [
        {
            "estimator": "Selected-row treated-control difference",
            "theta_hat": naive_selected_diff,
            "std_error": np.nan,
            "ci_95_lower": np.nan,
            "ci_95_upper": np.nan,
            "true_target": TRUE_MAR_ATE,
            "bias_vs_target": naive_selected_diff - TRUE_MAR_ATE,
        },
        {
            "estimator": "Synthetic full-population treated-control difference",
            "theta_hat": full_population_diff,
            "std_error": np.nan,
            "ci_95_lower": np.nan,
            "ci_95_upper": np.nan,
            "true_target": TRUE_MAR_ATE,
            "bias_vs_target": full_population_diff - TRUE_MAR_ATE,
        },
        {
            "estimator": "Selected-row OLS with covariates",
            "theta_hat": float(selected_ols.params["treated"]),
            "std_error": float(selected_ols.bse["treated"]),
            "ci_95_lower": float(selected_ols.conf_int().loc["treated", 0]),
            "ci_95_upper": float(selected_ols.conf_int().loc["treated", 1]),
            "true_target": TRUE_MAR_ATE,
            "bias_vs_target": float(selected_ols.params["treated"] - TRUE_MAR_ATE),
        },
    ]
)
save_table(baseline_summary, f"{NOTEBOOK_PREFIX}_mar_baseline_estimates.csv")
display(baseline_summary)
estimator theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 Selected-row treated-control difference 1.675867 NaN NaN NaN 1.032326 0.643540
1 Synthetic full-population treated-control diff... 1.732946 NaN NaN NaN 1.032326 0.700620
2 Selected-row OLS with covariates 1.065852 0.026952 1.013027 1.118678 1.032326 0.033526

The selected-row difference is far from the true ATE because it mixes treatment assignment, treatment heterogeneity, and outcome observation. Covariate adjustment helps, but it still does not explicitly model the missing-outcome process for the full target population.

Manual Cross-Fitted MAR Score

Before fitting DoubleMLSSM, we compute a small manual missing-at-random score. This makes the mechanics visible.

The code trains:

  • g(1, X) on selected treated rows;
  • g(0, X) on selected control rows;
  • pi(D, X) on all rows to predict selection;
  • m(X) on all rows to predict treatment.

Then it combines residualized observed outcomes with outcome-model predictions.

def crossfit_mar_ssm(df, y_col, d_col, s_col, x_cols, ml_g, ml_pi, ml_m, n_splits=5, seed=RANDOM_SEED):
    X = df[x_cols].reset_index(drop=True)
    y = df[y_col].to_numpy()
    d = df[d_col].to_numpy()
    s = df[s_col].to_numpy()
    dx = pd.concat([X, pd.Series(d, name=d_col)], axis=1)
    n = len(df)

    g1_hat = np.full(n, np.nan)
    g0_hat = np.full(n, np.nan)
    pi_hat = np.full(n, np.nan)
    m_hat = np.full(n, np.nan)

    strata = d + 2 * s
    splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

    for train_idx, test_idx in splitter.split(X, strata):
        train_selected_treated = train_idx[(s[train_idx] == 1) & (d[train_idx] == 1)]
        train_selected_control = train_idx[(s[train_idx] == 1) & (d[train_idx] == 0)]

        g1_model = clone(ml_g)
        g0_model = clone(ml_g)
        pi_model = clone(ml_pi)
        m_model = clone(ml_m)

        g1_model.fit(X.iloc[train_selected_treated], y[train_selected_treated])
        g0_model.fit(X.iloc[train_selected_control], y[train_selected_control])
        pi_model.fit(dx.iloc[train_idx], s[train_idx])
        m_model.fit(X.iloc[train_idx], d[train_idx])

        g1_hat[test_idx] = g1_model.predict(X.iloc[test_idx])
        g0_hat[test_idx] = g0_model.predict(X.iloc[test_idx])
        pi_hat[test_idx] = predict_probability(pi_model, dx.iloc[test_idx])
        m_hat[test_idx] = predict_probability(m_model, X.iloc[test_idx])

    pi_hat = clip_probability(pi_hat)
    m_hat = clip_probability(m_hat)

    treated_contribution = d * s * (y - g1_hat) / (m_hat * pi_hat) + g1_hat
    control_contribution = (1 - d) * s * (y - g0_hat) / ((1 - m_hat) * pi_hat) + g0_hat
    score_contribution = treated_contribution - control_contribution

    theta_hat = score_contribution.mean()
    std_error = score_contribution.std(ddof=1) / np.sqrt(n)

    summary = pd.DataFrame(
        [
            {
                "estimator": "Manual cross-fitted MAR SSM",
                "theta_hat": theta_hat,
                "std_error": std_error,
                "ci_95_lower": theta_hat - 1.96 * std_error,
                "ci_95_upper": theta_hat + 1.96 * std_error,
                "true_target": TRUE_MAR_ATE,
                "bias_vs_target": theta_hat - TRUE_MAR_ATE,
            }
        ]
    )
    diagnostics = pd.DataFrame(
        {
            "g1_hat": g1_hat,
            "g0_hat": g0_hat,
            "pi_hat": pi_hat,
            "m_hat": m_hat,
            "score_contribution": score_contribution,
            "treated": d,
            "selected": s,
        }
    )
    return summary, diagnostics

linear_g, linear_pi, linear_m = make_linear_learners()
manual_mar_summary, manual_mar_diagnostics = crossfit_mar_ssm(
    mar_df,
    y_col="outcome_observed",
    d_col="treated",
    s_col="selected",
    x_cols=feature_cols_mar,
    ml_g=linear_g,
    ml_pi=linear_pi,
    ml_m=linear_m,
    n_splits=5,
    seed=RANDOM_SEED,
)
save_table(manual_mar_summary, f"{NOTEBOOK_PREFIX}_manual_mar_ssm_summary.csv")
save_table(manual_mar_diagnostics.describe().reset_index(), f"{NOTEBOOK_PREFIX}_manual_mar_ssm_diagnostics.csv")
display(manual_mar_summary)
estimator theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 Manual cross-fitted MAR SSM 1.003513 0.030544 0.943647 1.06338 1.032326 -0.028813

The manual score is much closer to the true ATE than the selected-row difference. That is the core lesson: selection correction uses unselected rows through the selection and treatment models, even though their outcomes are not observed.

Manual Score Diagnostics

The manual diagnostics check whether treatment and selection probabilities are extreme. The score contribution plot shows whether a few rows dominate the estimate.

fig, axes = plt.subplots(1, 3, figsize=(15, 4.8))

sns.histplot(
    data=manual_mar_diagnostics,
    x="m_hat",
    hue="treated",
    bins=40,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[0],
)
axes[0].set_title("Treatment Propensity")
axes[0].set_xlabel("m_hat(X)")
axes[0].set_ylabel("Density")

sns.histplot(
    data=manual_mar_diagnostics,
    x="pi_hat",
    hue="selected",
    bins=40,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[1],
)
axes[1].set_title("Selection Probability")
axes[1].set_xlabel("pi_hat(D, X)")
axes[1].set_ylabel("Density")

sns.histplot(
    data=manual_mar_diagnostics,
    x="score_contribution",
    bins=70,
    color="#2563eb",
    ax=axes[2],
)
axes[2].axvline(manual_mar_summary["theta_hat"].iloc[0], color="#b91c1c", linestyle="--", label="Mean score")
axes[2].set_title("Score Contributions")
axes[2].set_xlabel("Contribution")
axes[2].set_ylabel("Count")
axes[2].legend()

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_manual_mar_score_diagnostics.png", dpi=160, bbox_inches="tight")
plt.show()

The treatment and selection predictions are mostly away from 0 and 1. That matters because inverse-probability terms become unstable when predicted probabilities are too small.

DoubleML Sample-Selection Backend

DoubleMLSSMData is the data contract for sample-selection models. The selection column is passed as s_col. Unselected rows stay in the data because they are needed for the treatment and selection nuisance models.

mar_dml_data = DoubleMLSSMData(
    mar_df,
    y_col="outcome_observed",
    d_cols="treated",
    s_col="selected",
    x_cols=feature_cols_mar,
)

mar_backend_summary = pd.DataFrame(
    [
        {"property": "n_obs", "value": mar_dml_data.n_obs},
        {"property": "outcome", "value": mar_dml_data.y_col},
        {"property": "treatment", "value": ", ".join(mar_dml_data.d_cols)},
        {"property": "selection", "value": mar_dml_data.s_col},
        {"property": "n_covariates", "value": len(mar_dml_data.x_cols)},
        {"property": "covariates", "value": ", ".join(mar_dml_data.x_cols)},
    ]
)
save_table(mar_backend_summary, f"{NOTEBOOK_PREFIX}_mar_doublemlssm_backend.csv")
display(mar_backend_summary)
property value
0 n_obs 6000
1 outcome outcome_observed
2 treatment treated
3 selection selected
4 n_covariates 6
5 covariates engagement_score, baseline_value, support_need...

This is the most important API cell in the notebook. If s_col is missing, the model no longer knows which outcomes are observed and the selected-outcome design is lost.

Fit DoubleMLSSM Under Missing At Random

We fit three versions:

  • linear nuisance learners;
  • gradient boosting nuisance learners;
  • gradient boosting with normalized inverse-probability weights.

The normalized version can be helpful when weights are noisy, but it is not a magic fix for poor overlap or wrong assumptions.

mar_model_rows = []
mar_models = {}

ml_g_linear, ml_pi_linear, ml_m_linear = make_linear_learners()
mar_ssm_linear = DoubleMLSSM(
    mar_dml_data,
    ml_g=ml_g_linear,
    ml_pi=ml_pi_linear,
    ml_m=ml_m_linear,
    n_folds=5,
    score="missing-at-random",
)
mar_ssm_linear.fit(store_predictions=True)
mar_models["MAR linear"] = mar_ssm_linear
mar_model_rows.append(summarize_doubleml(mar_ssm_linear, "MAR linear", "missing-at-random", TRUE_MAR_ATE))

ml_g_hgb, ml_pi_hgb, ml_m_hgb = make_hgb_learners(seed=RANDOM_SEED + 10, max_iter=90)
mar_ssm_hgb = DoubleMLSSM(
    mar_dml_data,
    ml_g=ml_g_hgb,
    ml_pi=ml_pi_hgb,
    ml_m=ml_m_hgb,
    n_folds=5,
    score="missing-at-random",
)
mar_ssm_hgb.fit(store_predictions=True)
mar_models["MAR HGB"] = mar_ssm_hgb
mar_model_rows.append(summarize_doubleml(mar_ssm_hgb, "MAR HGB", "missing-at-random", TRUE_MAR_ATE))

ml_g_norm, ml_pi_norm, ml_m_norm = make_hgb_learners(seed=RANDOM_SEED + 20, max_iter=90)
mar_ssm_hgb_norm = DoubleMLSSM(
    mar_dml_data,
    ml_g=ml_g_norm,
    ml_pi=ml_pi_norm,
    ml_m=ml_m_norm,
    n_folds=5,
    score="missing-at-random",
    normalize_ipw=True,
)
mar_ssm_hgb_norm.fit(store_predictions=True)
mar_models["MAR HGB normalized IPW"] = mar_ssm_hgb_norm
mar_model_rows.append(summarize_doubleml(mar_ssm_hgb_norm, "MAR HGB normalized IPW", "missing-at-random", TRUE_MAR_ATE))

mar_doubleml_summary = pd.DataFrame(mar_model_rows)
save_table(mar_doubleml_summary, f"{NOTEBOOK_PREFIX}_mar_doublemlssm_summary.csv")
display(mar_doubleml_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 MAR linear missing-at-random 1.001539 0.030564 0.941634 1.061444 1.032326 -0.030787
1 MAR HGB missing-at-random 1.002785 0.036305 0.931628 1.073943 1.032326 -0.029541
2 MAR HGB normalized IPW missing-at-random 1.023998 0.016952 0.990773 1.057222 1.032326 -0.008329

The DoubleML estimates are close to the true full-population ATE. The selected-row difference was much larger, which shows why the selection problem matters.

Estimate Comparison

The comparison plot puts the naive baselines, manual score, and DoubleML estimates on one axis. This is often the clearest way to communicate what selection correction changed.

mar_estimates = pd.concat(
    [
        baseline_summary,
        manual_mar_summary,
        mar_doubleml_summary,
    ],
    ignore_index=True,
    sort=False,
)

# Add simple intervals for rows without model-based intervals when available.
selected_diff_se = np.sqrt(
    selected_df.loc[selected_df["treated"] == 1, "outcome_observed"].var(ddof=1) / (selected_df["treated"] == 1).sum()
    + selected_df.loc[selected_df["treated"] == 0, "outcome_observed"].var(ddof=1) / (selected_df["treated"] == 0).sum()
)
mar_estimates.loc[mar_estimates["estimator"] == "Selected-row treated-control difference", "std_error"] = selected_diff_se
mar_estimates.loc[mar_estimates["estimator"] == "Selected-row treated-control difference", "ci_95_lower"] = naive_selected_diff - 1.96 * selected_diff_se
mar_estimates.loc[mar_estimates["estimator"] == "Selected-row treated-control difference", "ci_95_upper"] = naive_selected_diff + 1.96 * selected_diff_se

full_diff_se = np.sqrt(
    mar_df.loc[mar_df["treated"] == 1, "full_outcome"].var(ddof=1) / (mar_df["treated"] == 1).sum()
    + mar_df.loc[mar_df["treated"] == 0, "full_outcome"].var(ddof=1) / (mar_df["treated"] == 0).sum()
)
mar_estimates.loc[mar_estimates["estimator"] == "Synthetic full-population treated-control difference", "std_error"] = full_diff_se
mar_estimates.loc[mar_estimates["estimator"] == "Synthetic full-population treated-control difference", "ci_95_lower"] = full_population_diff - 1.96 * full_diff_se
mar_estimates.loc[mar_estimates["estimator"] == "Synthetic full-population treated-control difference", "ci_95_upper"] = full_population_diff + 1.96 * full_diff_se

save_table(mar_estimates, f"{NOTEBOOK_PREFIX}_mar_estimate_comparison.csv")
plot_mar_estimates = add_ci_columns(mar_estimates)

fig, ax = plt.subplots(figsize=(11, 6.5))
order = plot_mar_estimates.sort_values("theta_hat", ascending=False)["estimator"]
sns.pointplot(
    data=plot_mar_estimates,
    y="estimator",
    x="theta_hat",
    order=order,
    linestyle="none",
    color="#111827",
    errorbar=None,
    ax=ax,
)
for y_pos, (_, row) in enumerate(plot_mar_estimates.set_index("estimator").loc[order].reset_index().iterrows()):
    ax.errorbar(
        x=row["theta_hat"],
        y=y_pos,
        xerr=[[row["lower_error"]], [row["upper_error"]]],
        fmt="none",
        color="#475569",
        capsize=3,
        linewidth=1.1,
    )
ax.axvline(TRUE_MAR_ATE, color="#b91c1c", linestyle="--", linewidth=1.4, label="True ATE")
ax.axvline(TRUE_MAR_ATT, color="#2563eb", linestyle="--", linewidth=1.4, label="True ATT")
ax.set_title("Missing-At-Random Selection Estimate Comparison")
ax.set_xlabel("Estimated Effect")
ax.set_ylabel("")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_mar_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

display(mar_estimates)

estimator theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target design
0 Selected-row treated-control difference 1.675867 0.046957 1.583831 1.767903 1.032326 0.643540 NaN
1 Synthetic full-population treated-control diff... 1.732946 0.034447 1.665431 1.800462 1.032326 0.700620 NaN
2 Selected-row OLS with covariates 1.065852 0.026952 1.013027 1.118678 1.032326 0.033526 NaN
3 Manual cross-fitted MAR SSM 1.003513 0.030544 0.943647 1.063380 1.032326 -0.028813 NaN
4 MAR linear 1.001539 0.030564 0.941634 1.061444 1.032326 -0.030787 missing-at-random
5 MAR HGB 1.002785 0.036305 0.931628 1.073943 1.032326 -0.029541 missing-at-random
6 MAR HGB normalized IPW 1.023998 0.016952 0.990773 1.057222 1.032326 -0.008329 missing-at-random

The corrected estimates move toward the full-population ATE. The plot also shows that the ATT is a different target; selected-outcome correction does not remove the need to state the estimand.

Nuisance Learner Diagnostics

Now we inspect the nuisance learners. These diagnostics are not a proof of selection validity. They are checks that the prediction components are not obviously broken.

mar_loss_tables = []
for label, model in mar_models.items():
    mar_loss_tables.append(learner_loss_table(model, label))

mar_nuisance_losses = pd.concat(mar_loss_tables, ignore_index=True)
save_table(mar_nuisance_losses, f"{NOTEBOOK_PREFIX}_mar_nuisance_losses.csv")
display(mar_nuisance_losses)
model learner metric_value metric
0 MAR linear ml_g_d0 2.238000 RMSE for outcome learners; classification erro...
1 MAR linear ml_g_d1 2.837962 RMSE for outcome learners; classification erro...
2 MAR linear ml_pi 0.446393 RMSE for outcome learners; classification erro...
3 MAR linear ml_m 0.476925 RMSE for outcome learners; classification erro...
4 MAR HGB ml_g_d0 2.252895 RMSE for outcome learners; classification erro...
5 MAR HGB ml_g_d1 2.862023 RMSE for outcome learners; classification erro...
6 MAR HGB ml_pi 0.448827 RMSE for outcome learners; classification erro...
7 MAR HGB ml_m 0.480633 RMSE for outcome learners; classification erro...
8 MAR HGB normalized IPW ml_g_d0 2.245353 RMSE for outcome learners; classification erro...
9 MAR HGB normalized IPW ml_g_d1 2.863304 RMSE for outcome learners; classification erro...
10 MAR HGB normalized IPW ml_pi 0.449762 RMSE for outcome learners; classification erro...
11 MAR HGB normalized IPW ml_m 0.482784 RMSE for outcome learners; classification erro...

The outcome nuisance losses are measured on the outcome scale. The propensity learner losses summarize classification-style performance. If these losses were very poor, the final estimate would deserve more skepticism.

Prediction Quality Against Synthetic Truth

Because the data are synthetic, we can compare the HGB nuisance predictions to true treatment and selection probabilities. In real data, this truth is unavailable, but the same plots can still reveal extreme probabilities.

mar_hgb_predictions = pd.DataFrame(
    {
        "treated": mar_df["treated"],
        "selected": mar_df["selected"],
        "true_treatment_propensity": mar_df["true_treatment_propensity"],
        "true_selection_probability": mar_df["true_selection_probability"],
        "m_hat": prediction_vector(mar_ssm_hgb, "ml_m"),
        "pi_hat": prediction_vector(mar_ssm_hgb, "ml_pi"),
        "g0_hat": prediction_vector(mar_ssm_hgb, "ml_g_d0"),
        "g1_hat": prediction_vector(mar_ssm_hgb, "ml_g_d1"),
        "full_outcome": mar_df["full_outcome"],
        "true_tau": mar_df["true_tau"],
    }
)

mar_prediction_quality = pd.DataFrame(
    [
        {
            "quantity": "treatment propensity m_hat",
            "rmse_vs_truth": rmse(mar_hgb_predictions["true_treatment_propensity"], mar_hgb_predictions["m_hat"]),
            "mae_vs_truth": mean_absolute_error(mar_hgb_predictions["true_treatment_propensity"], mar_hgb_predictions["m_hat"]),
        },
        {
            "quantity": "selection probability pi_hat",
            "rmse_vs_truth": rmse(mar_hgb_predictions["true_selection_probability"], mar_hgb_predictions["pi_hat"]),
            "mae_vs_truth": mean_absolute_error(mar_hgb_predictions["true_selection_probability"], mar_hgb_predictions["pi_hat"]),
        },
    ]
)
save_table(mar_prediction_quality, f"{NOTEBOOK_PREFIX}_mar_hgb_prediction_quality.csv")
display(mar_prediction_quality)
quantity rmse_vs_truth mae_vs_truth
0 treatment propensity m_hat 0.072886 0.057121
1 selection probability pi_hat 0.073758 0.056554

The probability predictions track the synthetic truth reasonably well. This does not guarantee a causal estimate, but it confirms that the nuisance models are learning the intended observed structure.

Treatment And Selection Overlap

The next plot is one of the most important diagnostics in a sample-selection analysis. Both the treatment propensity and the selection probability appear in denominators. Near-zero probabilities can create unstable weights.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.histplot(
    data=mar_hgb_predictions,
    x="m_hat",
    hue="treated",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[0],
)
axes[0].set_title("Estimated Treatment Propensity")
axes[0].set_xlabel("m_hat(X)")
axes[0].set_ylabel("Density")

sns.histplot(
    data=mar_hgb_predictions,
    x="pi_hat",
    hue="selected",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[1],
)
axes[1].set_title("Estimated Selection Probability")
axes[1].set_xlabel("pi_hat(D, X)")
axes[1].set_ylabel("Density")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_mar_overlap_diagnostics.png", dpi=160, bbox_inches="tight")
plt.show()

The predicted probabilities have usable overlap and are not concentrated at the clipping boundaries. That is a good sign for this synthetic example.

Bootstrap Confidence Interval

The multiplier bootstrap gives another uncertainty view for the preferred MAR model. It quantifies sampling uncertainty under the maintained design assumptions.

mar_ssm_hgb.bootstrap(method="normal", n_rep_boot=500)
mar_bootstrap_ci = mar_ssm_hgb.confint(level=0.95).reset_index().rename(columns={"index": "treatment"})
mar_bootstrap_ci["estimator"] = "MAR HGB bootstrap CI"
save_table(mar_bootstrap_ci, f"{NOTEBOOK_PREFIX}_mar_hgb_bootstrap_ci.csv")
display(mar_bootstrap_ci)
treatment 2.5 % 97.5 % estimator
0 treated 0.931628 1.073943 MAR HGB bootstrap CI

The bootstrap interval should be reported alongside the point estimate, but it does not account for misspecified selection assumptions. Selection bias is a design concern, not just a sampling concern.

Selection Positivity Stress Test

This stress test shifts the selection equation so outcomes become easier or harder to observe. Lower selection rates usually increase uncertainty because fewer outcomes are available for the outcome regressions and inverse-selection correction.

stress_rows = []
for selection_shift in [-0.70, -0.35, 0.00, 0.35]:
    stress_probability = sigmoid(
        selection_shift
        - 0.25
        + 0.55 * mar_df["treated"]
        + 0.45 * mar_df["engagement_score"]
        - 0.55 * mar_df["support_need"]
        + 0.25 * mar_df["prior_usage"]
        + 0.70 * mar_df["selection_encouragement"]
    )
    stress_selected = rng.binomial(1, stress_probability)
    stress_df = mar_df.copy()
    stress_df["selected_stress"] = stress_selected
    stress_df["outcome_stress"] = stress_df["full_outcome"]
    stress_df.loc[stress_df["selected_stress"] == 0, "outcome_stress"] = 0.0

    stress_data = DoubleMLSSMData(
        stress_df,
        y_col="outcome_stress",
        d_cols="treated",
        s_col="selected_stress",
        x_cols=feature_cols_mar,
    )
    ml_g_stress, ml_pi_stress, ml_m_stress = make_linear_learners()
    stress_model = DoubleMLSSM(
        stress_data,
        ml_g=ml_g_stress,
        ml_pi=ml_pi_stress,
        ml_m=ml_m_stress,
        n_folds=5,
        score="missing-at-random",
    )
    stress_model.fit(store_predictions=False)
    stress_rows.append(
        {
            "selection_shift": selection_shift,
            "selected_share": stress_selected.mean(),
            "theta_hat": float(stress_model.coef[0]),
            "std_error": float(stress_model.se[0]),
            "true_target": TRUE_MAR_ATE,
            "bias_vs_target": float(stress_model.coef[0] - TRUE_MAR_ATE),
        }
    )

selection_stress = pd.DataFrame(stress_rows)
save_table(selection_stress, f"{NOTEBOOK_PREFIX}_selection_positivity_stress.csv")

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
sns.lineplot(data=selection_stress, x="selected_share", y="theta_hat", marker="o", ax=axes[0])
axes[0].axhline(TRUE_MAR_ATE, color="#b91c1c", linestyle="--", label="True ATE")
axes[0].set_title("Estimate As Outcome Visibility Changes")
axes[0].set_xlabel("Selected Share")
axes[0].set_ylabel("Estimated ATE")
axes[0].legend()

sns.lineplot(data=selection_stress, x="selected_share", y="std_error", marker="o", color="#b45309", ax=axes[1])
axes[1].set_title("Uncertainty As Outcome Visibility Changes")
axes[1].set_xlabel("Selected Share")
axes[1].set_ylabel("Standard Error")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_selection_positivity_stress.png", dpi=160, bbox_inches="tight")
plt.show()

display(selection_stress)

selection_shift selected_share theta_hat std_error true_target bias_vs_target
0 -0.70 0.368833 0.986534 0.036497 1.032326 -0.045792
1 -0.35 0.432000 1.016098 0.035148 1.032326 -0.016229
2 0.00 0.510167 1.049771 0.030484 1.032326 0.017444
3 0.35 0.577000 1.027837 0.027579 1.032326 -0.004490

The standard error tends to rise when fewer outcomes are observed. This is the practical cost of selected outcomes: even when the assumptions are correct, less outcome visibility means less information.

Nonignorable Selection Data

The next dataset adds a hidden response factor U. This hidden factor affects both outcome observation and the outcome itself. A missing-at-random analysis is now misspecified because conditioning on observed covariates is not enough.

We also add selection_encouragement, a variable that affects selection but does not directly affect the outcome. In the synthetic design, this variable satisfies the exclusion restriction. In real work, that restriction would require a domain argument.

rng_non = np.random.default_rng(RANDOM_SEED + 100)
N_NON = 6_000

non_engagement = rng_non.normal(0, 1, N_NON)
non_baseline = rng_non.normal(0, 1, N_NON)
non_support = rng_non.normal(0, 1, N_NON)
non_prior_usage = rng_non.normal(0, 1, N_NON)
non_mobile = rng_non.binomial(1, 0.48, N_NON)
non_selection_encouragement = rng_non.normal(0, 1, N_NON)
hidden_response_factor = rng_non.normal(0, 1, N_NON)

non_x = pd.DataFrame(
    {
        "engagement_score": non_engagement,
        "baseline_value": non_baseline,
        "support_need": non_support,
        "prior_usage": non_prior_usage,
        "mobile_user": non_mobile,
    }
)
feature_cols_non = list(non_x.columns)

non_treatment_propensity = sigmoid(-0.10 + 0.55 * non_engagement + 0.35 * non_baseline - 0.30 * non_support + 0.18 * non_mobile)
non_treated = rng_non.binomial(1, non_treatment_propensity)
non_tau = 1.00 + 0.25 * non_engagement + 0.15 * non_baseline - 0.10 * non_support + 0.08 * non_mobile
non_mu0 = 3.00 + 0.80 * non_engagement + 0.50 * non_baseline - 0.30 * non_support + 0.25 * non_prior_usage + 0.20 * non_mobile
non_full_outcome = non_mu0 + non_tau * non_treated + 0.60 * hidden_response_factor + rng_non.normal(0, 0.55, N_NON)

non_selection_probability = sigmoid(
    -0.25
    + 0.55 * non_treated
    + 0.45 * non_engagement
    - 0.55 * non_support
    + 0.25 * non_prior_usage
    + 0.85 * non_selection_encouragement
    + 0.75 * hidden_response_factor
)
non_selected = rng_non.binomial(1, non_selection_probability)
non_outcome_observed = non_full_outcome.copy()
non_outcome_observed[non_selected == 0] = 0.0

non_df = pd.concat(
    [
        pd.DataFrame({"row_id": np.arange(N_NON)}),
        non_x,
        pd.DataFrame(
            {
                "selection_encouragement": non_selection_encouragement,
                "treated": non_treated,
                "selected": non_selected,
                "outcome_observed": non_outcome_observed,
                "full_outcome": non_full_outcome,
                "true_tau": non_tau,
                "true_treatment_propensity": non_treatment_propensity,
                "true_selection_probability": non_selection_probability,
                "hidden_response_factor": hidden_response_factor,
            }
        ),
    ],
    axis=1,
)

TRUE_NON_ATE = non_df["true_tau"].mean()
non_path = DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_nonignorable_selection_data.csv"
non_df.to_csv(non_path, index=False)

print(f"Rows: {len(non_df):,}")
print(f"Selected share: {non_df['selected'].mean():.3f}")
print(f"True nonignorable-design ATE: {TRUE_NON_ATE:.4f}")
print(f"Saved data to: {non_path.relative_to(PROJECT_ROOT)}")
display(non_df.head())
Rows: 6,000
Selected share: 0.509
True nonignorable-design ATE: 1.0389
Saved data to: notebooks/tutorials/doubleml/outputs/datasets/08_synthetic_nonignorable_selection_data.csv
row_id engagement_score baseline_value support_need prior_usage mobile_user selection_encouragement treated selected outcome_observed full_outcome true_tau true_treatment_propensity true_selection_probability hidden_response_factor
0 0 -0.311445 -0.577787 -0.069680 0.874368 0 -1.616559 0 0 0.000000 1.737582 0.842439 0.388739 0.043021 -2.126387
1 1 -2.245198 0.178591 0.631997 -1.302306 1 -0.500702 0 0 0.000000 1.281574 0.482289 0.217218 0.075842 -0.188155
2 2 -0.214792 -0.104288 0.117372 -0.859742 0 0.019919 1 1 4.283298 4.283298 0.918922 0.428041 0.740645 1.478050
3 3 0.886864 0.208647 0.431936 -0.321071 1 -0.362617 1 1 5.454790 5.454790 1.289819 0.625093 0.550140 0.170938
4 4 0.174338 -0.218863 -0.605183 -0.134723 1 1.157343 1 0 0.000000 4.405956 1.151273 0.569752 0.851009 0.108234

This design deliberately violates missing-at-random selection. The hidden factor affects both selection and the outcome. The selection encouragement is available as an instrument-like variable for the nonignorable score.

Nonignorable Design Checks

Before fitting the nonignorable model, check that the selection encouragement is relevant for selection and not mechanically related to treatment. In synthetic data, we know it has no direct outcome effect, but applied work would need a real argument.

non_checks = pd.DataFrame(
    [
        {"metric": "selected_share", "value": non_df["selected"].mean()},
        {"metric": "treated_share", "value": non_df["treated"].mean()},
        {"metric": "corr_selection_encouragement_selected", "value": non_df["selection_encouragement"].corr(non_df["selected"])},
        {"metric": "corr_selection_encouragement_treated", "value": non_df["selection_encouragement"].corr(non_df["treated"])},
        {"metric": "corr_hidden_factor_selected", "value": non_df["hidden_response_factor"].corr(non_df["selected"])},
        {"metric": "corr_hidden_factor_full_outcome", "value": non_df["hidden_response_factor"].corr(non_df["full_outcome"])},
    ]
)
save_table(non_checks, f"{NOTEBOOK_PREFIX}_nonignorable_design_checks.csv")
display(non_checks)
metric value
0 selected_share 0.509000
1 treated_share 0.512667
2 corr_selection_encouragement_selected 0.300239
3 corr_selection_encouragement_treated 0.018788
4 corr_hidden_factor_selected 0.269843
5 corr_hidden_factor_full_outcome 0.372271

The selection encouragement is relevant for selection. The hidden factor is also related to selection and the outcome, which is why the missing-at-random analysis is no longer conceptually right.

MAR Versus Nonignorable Scores

Now we fit two models to the nonignorable dataset:

  • a missing-at-random score that treats the selection process as observed-covariate explainable;
  • a nonignorable score that uses selection_encouragement as a selection instrument.

The nonignorable backend passes selection_encouragement through z_cols instead of treating it as an ordinary covariate.

non_mar_features = feature_cols_non + ["selection_encouragement"]
non_mar_data = DoubleMLSSMData(
    non_df,
    y_col="outcome_observed",
    d_cols="treated",
    s_col="selected",
    x_cols=non_mar_features,
)
non_iv_data = DoubleMLSSMData(
    non_df,
    y_col="outcome_observed",
    d_cols="treated",
    s_col="selected",
    x_cols=feature_cols_non,
    z_cols="selection_encouragement",
)

non_rows = []

non_mar_g, non_mar_pi, non_mar_m = make_linear_learners()
wrong_mar_model = DoubleMLSSM(
    non_mar_data,
    ml_g=non_mar_g,
    ml_pi=non_mar_pi,
    ml_m=non_mar_m,
    n_folds=5,
    score="missing-at-random",
)
wrong_mar_model.fit(store_predictions=True)
non_rows.append(summarize_doubleml(wrong_mar_model, "MAR score on nonignorable data", "nonignorable data", TRUE_NON_ATE))

non_g, non_pi, non_m = make_linear_learners()
nonignorable_model = DoubleMLSSM(
    non_iv_data,
    ml_g=non_g,
    ml_pi=non_pi,
    ml_m=non_m,
    n_folds=5,
    score="nonignorable",
)
nonignorable_model.fit(store_predictions=True)
non_rows.append(summarize_doubleml(nonignorable_model, "Nonignorable score with selection instrument", "nonignorable data", TRUE_NON_ATE))

non_hgb_g, non_hgb_pi, non_hgb_m = make_hgb_learners(seed=RANDOM_SEED + 60, max_iter=70)
nonignorable_hgb = DoubleMLSSM(
    non_iv_data,
    ml_g=non_hgb_g,
    ml_pi=non_hgb_pi,
    ml_m=non_hgb_m,
    n_folds=5,
    score="nonignorable",
)
nonignorable_hgb.fit(store_predictions=True)
non_rows.append(summarize_doubleml(nonignorable_hgb, "Nonignorable score HGB", "nonignorable data", TRUE_NON_ATE))

nonignorable_summary = pd.DataFrame(non_rows)
save_table(nonignorable_summary, f"{NOTEBOOK_PREFIX}_nonignorable_score_comparison.csv")
display(nonignorable_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 MAR score on nonignorable data nonignorable data 1.027318 0.035579 0.957586 1.097051 1.038926 -0.011607
1 Nonignorable score with selection instrument nonignorable data 1.070113 0.044343 0.983202 1.157024 1.038926 0.031187
2 Nonignorable score HGB nonignorable data 1.041098 0.046897 0.949182 1.133014 1.038926 0.002172

The nonignorable score uses extra structure to address a hidden selection path. The improvement is not automatic in every sample, and it depends on the selection encouragement being a credible exclusion variable.

Nonignorable Estimate Plot

The plot below keeps the nonignorable comparison separate from the missing-at-random dataset because the data-generating process is different. The point is not to crown a universal best model; it is to show that the score must match the selection story.

plot_non = add_ci_columns(nonignorable_summary)
fig, ax = plt.subplots(figsize=(10, 5.5))
order = plot_non.sort_values("theta_hat", ascending=False)["estimator"]
sns.pointplot(
    data=plot_non,
    y="estimator",
    x="theta_hat",
    order=order,
    linestyle="none",
    color="#111827",
    errorbar=None,
    ax=ax,
)
for y_pos, (_, row) in enumerate(plot_non.set_index("estimator").loc[order].reset_index().iterrows()):
    ax.errorbar(
        x=row["theta_hat"],
        y=y_pos,
        xerr=[[row["lower_error"]], [row["upper_error"]]],
        fmt="none",
        color="#475569",
        capsize=3,
        linewidth=1.1,
    )
ax.axvline(TRUE_NON_ATE, color="#b91c1c", linestyle="--", linewidth=1.4, label="True ATE")
ax.set_title("Nonignorable Selection Score Comparison")
ax.set_xlabel("Estimated Effect")
ax.set_ylabel("")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nonignorable_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The figure makes the design choice visible. When hidden response behavior matters, a missing-at-random score is answering under the wrong assumption, even if it runs cleanly.

Combined Summary Table

This table collects the main estimates from the notebook. The rows should be compared within their design setting, not as interchangeable models.

combined_ssm_summary = pd.concat(
    [
        baseline_summary.assign(section="MAR data baselines"),
        manual_mar_summary.assign(section="MAR data manual score"),
        mar_doubleml_summary.assign(section="MAR data DoubleML"),
        nonignorable_summary.assign(section="nonignorable data"),
    ],
    ignore_index=True,
    sort=False,
)
save_table(combined_ssm_summary, f"{NOTEBOOK_PREFIX}_combined_ssm_summary.csv")
display(combined_ssm_summary)
estimator theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target section design
0 Selected-row treated-control difference 1.675867 NaN NaN NaN 1.032326 0.643540 MAR data baselines NaN
1 Synthetic full-population treated-control diff... 1.732946 NaN NaN NaN 1.032326 0.700620 MAR data baselines NaN
2 Selected-row OLS with covariates 1.065852 0.026952 1.013027 1.118678 1.032326 0.033526 MAR data baselines NaN
3 Manual cross-fitted MAR SSM 1.003513 0.030544 0.943647 1.063380 1.032326 -0.028813 MAR data manual score NaN
4 MAR linear 1.001539 0.030564 0.941634 1.061444 1.032326 -0.030787 MAR data DoubleML missing-at-random
5 MAR HGB 1.002785 0.036305 0.931628 1.073943 1.032326 -0.029541 MAR data DoubleML missing-at-random
6 MAR HGB normalized IPW 1.023998 0.016952 0.990773 1.057222 1.032326 -0.008329 MAR data DoubleML missing-at-random
7 MAR score on nonignorable data 1.027318 0.035579 0.957586 1.097051 1.038926 -0.011607 nonignorable data nonignorable data
8 Nonignorable score with selection instrument 1.070113 0.044343 0.983202 1.157024 1.038926 0.031187 nonignorable data nonignorable data
9 Nonignorable score HGB 1.041098 0.046897 0.949182 1.133014 1.038926 0.002172 nonignorable data nonignorable data

The combined table shows the notebook’s main story: selected-row-only estimates can be very misleading, missing-at-random correction works when its assumptions match the data, and nonignorable selection requires stronger structure.

Reporting Checklist

A sample-selection report needs more than an effect estimate. The checklist below names the assumptions and diagnostics a reader should see.

ssm_reporting_checklist = pd.DataFrame(
    [
        {"topic": "target population", "question": "Is the effect for all rows or selected rows only?", "notebook_answer": "The main target is the full-population ATE."},
        {"topic": "selection definition", "question": "What does S=1 mean?", "notebook_answer": "The outcome is observed."},
        {"topic": "unselected rows", "question": "Were unselected rows retained?", "notebook_answer": "Yes. They enter treatment and selection nuisance models."},
        {"topic": "MAR assumption", "question": "Can observed D and X explain outcome observation?", "notebook_answer": "True in the MAR synthetic design, false in the nonignorable design."},
        {"topic": "selection overlap", "question": "Are selection probabilities away from zero?", "notebook_answer": "Checked with selection-probability histograms and stress tests."},
        {"topic": "treatment overlap", "question": "Are treatment propensities away from zero and one?", "notebook_answer": "Checked with treatment-propensity histograms."},
        {"topic": "nonignorable selection", "question": "If MAR is not credible, is there a valid selection instrument?", "notebook_answer": "The tutorial uses a synthetic selection encouragement as z_cols."},
        {"topic": "uncertainty", "question": "Are intervals and diagnostics reported?", "notebook_answer": "Yes. Standard errors, bootstrap CI, and nuisance diagnostics are saved."},
    ]
)
save_table(ssm_reporting_checklist, f"{NOTEBOOK_PREFIX}_ssm_reporting_checklist.csv")
display(ssm_reporting_checklist)
topic question notebook_answer
0 target population Is the effect for all rows or selected rows only? The main target is the full-population ATE.
1 selection definition What does S=1 mean? The outcome is observed.
2 unselected rows Were unselected rows retained? Yes. They enter treatment and selection nuisan...
3 MAR assumption Can observed D and X explain outcome observation? True in the MAR synthetic design, false in the...
4 selection overlap Are selection probabilities away from zero? Checked with selection-probability histograms ...
5 treatment overlap Are treatment propensities away from zero and ... Checked with treatment-propensity histograms.
6 nonignorable selection If MAR is not credible, is there a valid selec... The tutorial uses a synthetic selection encour...
7 uncertainty Are intervals and diagnostics reported? Yes. Standard errors, bootstrap CI, and nuisan...

This checklist turns the notebook into a reusable analysis template. The estimate is only one part of the evidence; the selection story is equally important.

Report Template

The final report template writes the preferred MAR result and the nonignorable comparison in plain language. The language is cautious because sample selection assumptions are often the hardest part of the design.

best_mar = mar_doubleml_summary.loc[mar_doubleml_summary["estimator"] == "MAR HGB"].iloc[0]
best_non = nonignorable_summary.loc[nonignorable_summary["estimator"] == "Nonignorable score with selection instrument"].iloc[0]

report_text = f"""
# Sample Selection DoubleML Report Template

## Question

Estimate the treatment effect for the full target population when the outcome is observed only for selected rows.

## Preferred Missing-At-Random Result

The preferred MAR estimate uses `DoubleMLSSM` with gradient boosting nuisance learners.

- Estimate: {best_mar['theta_hat']:.4f}
- Standard error: {best_mar['std_error']:.4f}
- 95 percent CI: [{best_mar['ci_95_lower']:.4f}, {best_mar['ci_95_upper']:.4f}]
- Synthetic true ATE: {TRUE_MAR_ATE:.4f}

## Nonignorable Selection Demonstration

When selection depends on a hidden response factor, a MAR score is not conceptually sufficient. The nonignorable score uses a selection encouragement variable as an exclusion variable.

- Nonignorable estimate: {best_non['theta_hat']:.4f}
- Standard error: {best_non['std_error']:.4f}
- 95 percent CI: [{best_non['ci_95_lower']:.4f}, {best_non['ci_95_upper']:.4f}]
- Synthetic true ATE: {TRUE_NON_ATE:.4f}

## Assumptions To State

- Treatment assignment is unconfounded after conditioning on observed covariates.
- Outcome observation is missing at random after conditioning on treatment and covariates, unless using the nonignorable score.
- Treatment and selection probabilities have adequate overlap.
- For nonignorable selection, the selection encouragement affects outcome observation and has no direct outcome effect.

## Diagnostics To Include

- Selected versus unselected covariate balance.
- Treatment and selection probability overlap.
- Selected-row baseline estimates versus selection-adjusted estimates.
- Nuisance learner diagnostics.
- Stress test showing how lower outcome visibility affects uncertainty.
""".strip()

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_ssm_report_template.md"
report_path.write_text(report_text)
print(report_text)
# Sample Selection DoubleML Report Template

## Question

Estimate the treatment effect for the full target population when the outcome is observed only for selected rows.

## Preferred Missing-At-Random Result

The preferred MAR estimate uses `DoubleMLSSM` with gradient boosting nuisance learners.

- Estimate: 1.0028
- Standard error: 0.0363
- 95 percent CI: [0.9316, 1.0739]
- Synthetic true ATE: 1.0323

## Nonignorable Selection Demonstration

When selection depends on a hidden response factor, a MAR score is not conceptually sufficient. The nonignorable score uses a selection encouragement variable as an exclusion variable.

- Nonignorable estimate: 1.0701
- Standard error: 0.0443
- 95 percent CI: [0.9832, 1.1570]
- Synthetic true ATE: 1.0389

## Assumptions To State

- Treatment assignment is unconfounded after conditioning on observed covariates.
- Outcome observation is missing at random after conditioning on treatment and covariates, unless using the nonignorable score.
- Treatment and selection probabilities have adequate overlap.
- For nonignorable selection, the selection encouragement affects outcome observation and has no direct outcome effect.

## Diagnostics To Include

- Selected versus unselected covariate balance.
- Treatment and selection probability overlap.
- Selected-row baseline estimates versus selection-adjusted estimates.
- Nuisance learner diagnostics.
- Stress test showing how lower outcome visibility affects uncertainty.

The report template keeps the estimate tied to its assumptions. That is especially important for selected-outcome data, where the missing outcomes are precisely the part we cannot inspect directly.

Artifact Manifest

The manifest lists every dataset, figure, table, and report produced by this notebook. This makes the tutorial easy to audit and reuse.

artifact_rows = []
for folder, artifact_type in [
    (DATASET_DIR, "dataset"),
    (TABLE_DIR, "table"),
    (FIGURE_DIR, "figure"),
    (REPORT_DIR, "report"),
]:
    for file_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": file_path.name,
                "relative_path": str(file_path.relative_to(PROJECT_ROOT)),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")
display(artifact_manifest)
artifact_type file_name relative_path
0 dataset 08_synthetic_mar_selection_data.csv notebooks/tutorials/doubleml/outputs/datasets/...
1 dataset 08_synthetic_nonignorable_selection_data.csv notebooks/tutorials/doubleml/outputs/datasets/...
2 table 08_artifact_manifest.csv notebooks/tutorials/doubleml/outputs/tables/08...
3 table 08_combined_ssm_summary.csv notebooks/tutorials/doubleml/outputs/tables/08...
4 table 08_manual_mar_ssm_diagnostics.csv notebooks/tutorials/doubleml/outputs/tables/08...
5 table 08_manual_mar_ssm_summary.csv notebooks/tutorials/doubleml/outputs/tables/08...
6 table 08_mar_baseline_estimates.csv notebooks/tutorials/doubleml/outputs/tables/08...
7 table 08_mar_data_audit.csv notebooks/tutorials/doubleml/outputs/tables/08...
8 table 08_mar_doublemlssm_backend.csv notebooks/tutorials/doubleml/outputs/tables/08...
9 table 08_mar_doublemlssm_summary.csv notebooks/tutorials/doubleml/outputs/tables/08...
10 table 08_mar_estimate_comparison.csv notebooks/tutorials/doubleml/outputs/tables/08...
11 table 08_mar_field_dictionary.csv notebooks/tutorials/doubleml/outputs/tables/08...
12 table 08_mar_hgb_bootstrap_ci.csv notebooks/tutorials/doubleml/outputs/tables/08...
13 table 08_mar_hgb_prediction_quality.csv notebooks/tutorials/doubleml/outputs/tables/08...
14 table 08_mar_nuisance_losses.csv notebooks/tutorials/doubleml/outputs/tables/08...
15 table 08_mar_selection_covariate_balance.csv notebooks/tutorials/doubleml/outputs/tables/08...
16 table 08_mar_treatment_selection_cells.csv notebooks/tutorials/doubleml/outputs/tables/08...
17 table 08_nonignorable_design_checks.csv notebooks/tutorials/doubleml/outputs/tables/08...
18 table 08_nonignorable_score_comparison.csv notebooks/tutorials/doubleml/outputs/tables/08...
19 table 08_package_versions.csv notebooks/tutorials/doubleml/outputs/tables/08...
20 table 08_selection_positivity_stress.csv notebooks/tutorials/doubleml/outputs/tables/08...
21 table 08_ssm_reporting_checklist.csv notebooks/tutorials/doubleml/outputs/tables/08...
22 figure 08_manual_mar_score_diagnostics.png notebooks/tutorials/doubleml/outputs/figures/0...
23 figure 08_mar_estimate_comparison.png notebooks/tutorials/doubleml/outputs/figures/0...
24 figure 08_mar_overlap_diagnostics.png notebooks/tutorials/doubleml/outputs/figures/0...
25 figure 08_mar_selection_and_outcome_visibility.png notebooks/tutorials/doubleml/outputs/figures/0...
26 figure 08_nonignorable_estimate_comparison.png notebooks/tutorials/doubleml/outputs/figures/0...
27 figure 08_selection_positivity_stress.png notebooks/tutorials/doubleml/outputs/figures/0...
28 figure 08_ssm_design_dag.png notebooks/tutorials/doubleml/outputs/figures/0...
29 report 08_ssm_report_template.md notebooks/tutorials/doubleml/outputs/reports/0...

The sample-selection notebook is complete. The next natural topic is regression discontinuity design, where identification comes from continuity around a cutoff rather than selection correction, instruments, or parallel trends.