This notebook teaches sample-selection models with DoubleML. The central problem is outcome attrition: we want a causal effect for a target population, but the outcome is observed only for a selected subset of rows.
This comes up constantly in applied data work. A user may only leave a rating if they engaged. A customer satisfaction survey may only be answered by a nonrandom subset. A downstream business outcome may only be measured after a user completes an earlier funnel step. If we estimate treatment effects only on the selected rows, the result can describe the selected sample rather than the full target population.
DoubleMLSSM handles this setting by combining treatment propensity learning, selection-probability learning, and outcome regression. The machine learning helps estimate nuisance functions flexibly; the causal validity still comes from assumptions about treatment assignment and outcome observation.
Learning Goals
By the end of this notebook, you should be able to:
identify when selected outcomes threaten causal effect estimation;
distinguish treatment selection from sample selection;
explain missing-at-random selection and nonignorable selection;
build DoubleMLSSMData with outcome, treatment, covariates, selection indicator, and optional selection instrument;
fit DoubleMLSSM with linear and tree-based nuisance learners;
inspect outcome, treatment, and selection nuisance predictions;
understand why unselected rows must remain in the dataset;
report selection assumptions, overlap concerns, and sensitivity risks clearly.
The Selection Problem
A standard binary-treatment ATE problem has an outcome Y, treatment D, and covariates X. Sample selection adds another variable, S:
S = 1 if the outcome is observed
S = 0 if the outcome is missing or not measured
The dangerous shortcut is to drop rows with S = 0 and run a treatment-effect model on selected rows only. That selected-row analysis answers a different question unless selection is completely unrelated to potential outcomes and treatment effects.
Sample selection is not the same as treatment selection. Treatment selection asks why D differs across rows. Sample selection asks why Y is observed for some rows and not others. In selected-outcome problems, we often need nuisance models for both.
Missing-At-Random Selection
The missing-at-random version assumes that, after conditioning on treatment and observed covariates, whether the outcome is observed does not depend on the missing potential outcomes.
In words:
Outcome observation can depend on D and X,
but not on hidden outcome shocks once D and X are fixed.
This is a strong assumption, but it can be plausible in some settings. For example, if survey response depends on known user activity, tenure, device type, and treatment status, and those variables capture the relevant response process, a missing-at-random correction may be credible.
The key nuisance functions are:
g(d, X) = E[Y | S = 1, D = d, X], the outcome model among selected rows;
DoubleML combines these through an orthogonal score so that nuisance estimation error has less first-order impact on the final treatment-effect estimate.
Nonignorable Selection
Sometimes outcome observation depends on hidden factors that also affect the outcome. This is nonignorable selection. A satisfaction survey may be more likely to be answered by unusually happy or unusually unhappy users even after conditioning on observed covariates. In that case, missing-at-random adjustment can still be biased.
DoubleMLSSM also supports a nonignorable score. That setup requires an instrument-like variable for selection: a variable that affects whether the outcome is observed but does not directly affect the outcome itself. In this notebook we call it selection_encouragement.
The exclusion restriction is doing real causal work. The package can use the variable, but it cannot prove that the variable affects selection only through observation and not through the outcome.
What DoubleML Adds
The sample-selection score is a doubly robust style construction. For the missing-at-random case, the treatment arm contribution looks like:
D * S * (Y - g(1, X)) / [m(X) * pi(1, X)] + g(1, X)
The control arm has the analogous term with 1 - D and g(0, X). The ATE estimate is the average treatment contribution minus the average control contribution.
This structure explains the three nuisance roles:
the outcome model fills in expected outcomes;
the treatment propensity adjusts for treatment assignment;
the selection probability upweights observed outcomes to represent rows whose outcomes are missing.
Cross-fitting keeps nuisance predictions out of sample. Orthogonality makes the final estimate less sensitive to small nuisance-model errors, but it does not remove the need for selection assumptions and overlap.
Runtime Note
This notebook fits several sample-selection models, including stress tests and a nonignorable-selection example. A full run should take about 2 to 4 minutes on a typical laptop.
Setup
The setup cell imports the scientific Python stack, configures output folders, and suppresses only the narrow warnings that would clutter tutorial output. Code is kept visible throughout the notebook.
The version table is part of the reproducibility record. The installed version exposes DoubleMLSSM and DoubleMLSSMData, which are the two classes used in this tutorial.
Helper Functions
The helpers below keep the notebook readable. They save tables, construct nuisance learners, extract DoubleML predictions, and compute compact diagnostics. None of them changes the causal estimand.
def save_table(df, file_name, index=False):"""Save a table under the notebook output folder and return the DataFrame.""" path = TABLE_DIR / file_name df.to_csv(path, index=index)return dfdef sigmoid(x):"""Logistic transform used for synthetic probabilities."""return1.0/ (1.0+ np.exp(-x))def rmse(y_true, y_pred):"""Root mean squared error with one-dimensional arrays."""returnfloat(np.sqrt(mean_squared_error(np.asarray(y_true).ravel(), np.asarray(y_pred).ravel())))def clip_probability(values, lower=0.01, upper=0.99):"""Clip probabilities away from exact 0 and 1 for stable inverse weights."""return np.clip(np.asarray(values, dtype=float), lower, upper)def predict_probability(model, X):"""Return P(class=1) from a classifier."""return model.predict_proba(X)[:, 1]def make_linear_learners():"""Low-variance nuisance learners for mostly linear synthetic designs.""" ml_g = make_pipeline(StandardScaler(), LinearRegression()) ml_pi = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1_000)) ml_m = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1_000))return ml_g, ml_pi, ml_mdef make_hgb_learners(seed=RANDOM_SEED, max_iter=80):"""Tree-based nuisance learners for more flexible adjustment.""" ml_g = HistGradientBoostingRegressor( max_iter=max_iter, max_leaf_nodes=15, min_samples_leaf=30, learning_rate=0.06, random_state=seed, ) ml_pi = HistGradientBoostingClassifier( max_iter=max_iter, max_leaf_nodes=15, min_samples_leaf=30, learning_rate=0.06, random_state=seed +1, ) ml_m = HistGradientBoostingClassifier( max_iter=max_iter, max_leaf_nodes=15, min_samples_leaf=30, learning_rate=0.06, random_state=seed +2, )return ml_g, ml_pi, ml_mdef prediction_vector(model, key):"""Extract one DoubleML prediction array as a flat vector."""return np.asarray(model.predictions[key]).reshape(-1)def summarize_doubleml(model, label, design, true_target):"""Create a compact one-row summary from a fitted DoubleML object.""" ci = model.confint(level=0.95).iloc[0] theta =float(model.coef[0]) se =float(model.se[0])return {"estimator": label,"design": design,"theta_hat": theta,"std_error": se,"ci_95_lower": float(ci.iloc[0]),"ci_95_upper": float(ci.iloc[1]),"true_target": float(true_target),"bias_vs_target": theta -float(true_target), }def learner_loss_table(model, model_label):"""Convert DoubleML learner losses to a tidy table.""" losses = model.evaluate_learners() rows = []for learner_name, values in losses.items(): rows.append( {"model": model_label,"learner": learner_name,"metric_value": float(np.asarray(values).mean()),"metric": "RMSE for outcome learners; classification error style score for propensity learners", } )return pd.DataFrame(rows)def add_ci_columns(df):"""Prepare lower and upper error-bar columns for plotting.""" out = df.copy() out["lower_error"] = out["theta_hat"] - out["ci_95_lower"] out["upper_error"] = out["ci_95_upper"] - out["theta_hat"]return out
The helper functions make the model code easier to read. The key idea to keep in mind is that SSM has three nuisance tasks: outcome regression, selection-probability modeling, and treatment-propensity modeling.
Teaching Diagram
The diagram separates treatment assignment from outcome observation. D determines which potential outcome is relevant. S determines whether we see that outcome. Both can depend on observed controls, and selection can also depend on treatment.
from matplotlib.patches import FancyArrowPatch, FancyBboxPatchnodes = {"X": {"xy": (0.10, 0.54), "label": "Observed\ncontrols X", "color": "#dbeafe"},"D": {"xy": (0.34, 0.74), "label": "Treatment\nD", "color": "#dcfce7"},"S": {"xy": (0.34, 0.30), "label": "Outcome\nobserved S", "color": "#fef3c7"},"Y": {"xy": (0.64, 0.58), "label": "Outcome\nY", "color": "#fee2e2"},"Z": {"xy": (0.62, 0.22), "label": "Selection\nencouragement Z", "color": "#e0f2fe"},"U": {"xy": (0.60, 0.88), "label": "Hidden response\nfactor U", "color": "#f3f4f6"},"O": {"xy": (0.90, 0.52), "label": "Observed data\nS x Y", "color": "#ede9fe"},}fig, ax = plt.subplots(figsize=(12, 6))ax.set_axis_off()box_w, box_h =0.15, 0.11arrow_gap =0.018def anchor(node, side): x, y = nodes[node]["xy"] offsets = {"left": (-box_w /2, 0),"right": (box_w /2, 0),"top": (0, box_h /2),"bottom": (0, -box_h /2),"upper_right": (box_w /2, box_h *0.25),"lower_right": (box_w /2, -box_h *0.25),"upper_left": (-box_w /2, box_h *0.25),"lower_left": (-box_w /2, -box_h *0.25), } dx, dy = offsets[side]return (x + dx, y + dy)def shorten_segment(start, end, gap=arrow_gap):"""Move arrow endpoints inward so arrowheads do not sit on top of boxes.""" start = np.asarray(start, dtype=float) end = np.asarray(end, dtype=float) delta = end - start length = np.hypot(delta[0], delta[1])if length ==0:returntuple(start), tuple(end) unit = delta / lengthreturntuple(start + gap * unit), tuple(end - gap * unit)def shorten_polyline(points, gap=arrow_gap):"""Shorten only the first and final endpoints of a routed arrow.""" pts = [tuple(point) for point in points]iflen(pts) <2:return pts pts[0], _ = shorten_segment(pts[0], pts[1], gap=gap) _, pts[-1] = shorten_segment(pts[-2], pts[-1], gap=gap)return ptsdef draw_arrow(start, end, color, style="solid", rad=0.0, linewidth=1.7, gap=arrow_gap): start, end = shorten_segment(start, end, gap=gap) arrow = FancyArrowPatch( start, end, arrowstyle="-|>", mutation_scale=18, linewidth=linewidth, color=color, linestyle=style, shrinkA=0, shrinkB=0, connectionstyle=f"arc3,rad={rad}", zorder=5, ) ax.add_patch(arrow)def draw_routed_arrow(points, color, style="solid", linewidth=1.7, gap=arrow_gap): pts = shorten_polyline(points, gap=gap)for start, end inzip(pts[:-2], pts[1:-1]): ax.plot([start[0], end[0]], [start[1], end[1]], color=color, linestyle=style, linewidth=linewidth, zorder=5) draw_arrow(pts[-2], pts[-1], color=color, style=style, linewidth=linewidth, gap=0.0)# Main observed paths.draw_arrow(anchor("X", "upper_right"), anchor("D", "left"), color="#334155")draw_arrow(anchor("X", "lower_right"), anchor("S", "left"), color="#334155")draw_routed_arrow([anchor("X", "right"), (0.38, 0.56), anchor("Y", "left")], color="#334155")draw_arrow(anchor("D", "right"), anchor("Y", "upper_left"), color="#15803d")draw_arrow(anchor("D", "bottom"), anchor("S", "top"), color="#b45309")draw_arrow(anchor("Z", "left"), anchor("S", "lower_right"), color="#0369a1")draw_arrow(anchor("Y", "right"), anchor("O", "upper_left"), color="#334155")draw_arrow(anchor("S", "right"), anchor("O", "lower_left"), color="#b45309")# Dashed paths are separated so the nonignorable-selection risk is visible without crossing boxes.draw_arrow(anchor("U", "lower_right"), anchor("Y", "top"), color="#6b7280", style="dashed", linewidth=1.5)draw_routed_arrow([anchor("U", "lower_left"), (0.48, 0.66), anchor("S", "upper_right")], color="#6b7280", style="dashed", linewidth=1.5)for spec in nodes.values(): x, y = spec["xy"] rect = FancyBboxPatch( (x - box_w /2, y - box_h /2), box_w, box_h, boxstyle="round,pad=0.018", facecolor=spec["color"], edgecolor="#334155", linewidth=1.2, zorder=3, ) ax.add_patch(rect) ax.text(x, y, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)ax.text(0.50,0.08,"SSM estimates a treatment effect when Y is observed only for selected rows. The selection assumptions are part of the design.", ha="center", va="center", fontsize=10, color="#475569",)ax.set_title("Sample Selection Model Teaching Design", pad=18)plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_ssm_design_dag.png", dpi=160, bbox_inches="tight")plt.show()
The dashed paths show the nonignorable-selection risk. If a hidden response factor affects both outcome observation and the outcome itself, a missing-at-random correction can fail. The selection encouragement Z is useful only if it shifts observation without directly shifting the outcome.
Synthetic Missing-At-Random Data
The first dataset is a missing-at-random design. Selection depends on treatment and observed covariates, including a selection encouragement signal. There is no hidden response factor that directly affects both selection and the outcome.
Because this is synthetic data, we keep a full_outcome column for evaluation. In real selected-outcome data, that full outcome would not be available for unselected rows.
Rows: 6,000
Selected share: 0.518
True MAR ATE: 1.0323
True MAR ATT: 1.1265
Saved data to: notebooks/tutorials/doubleml/outputs/datasets/08_synthetic_mar_selection_data.csv
row_id
engagement_score
baseline_value
support_need
prior_usage
mobile_user
selection_encouragement
treated
selected
outcome_observed
full_outcome
true_tau
true_treatment_propensity
true_selection_probability
0
0
0.304717
0.225117
-0.309788
0.636265
0
0.178007
1
0
0.000000
3.855320
1.140926
0.559546
0.709133
1
1
-1.039984
-0.053343
0.060236
1.579594
1
0.177642
1
1
2.795688
2.795688
0.805979
0.370813
0.578867
2
2
0.750451
0.864044
-1.240756
-0.668791
1
0.985953
1
1
6.691277
6.691277
1.521295
0.762673
0.863314
3
3
0.940565
0.606733
-1.387754
-0.545453
0
-0.306807
1
1
6.255977
6.255977
1.464927
0.740005
0.756832
4
4
-1.951035
-2.985823
0.182298
-0.208867
1
0.442759
1
0
0.000000
1.885728
0.126138
0.109799
0.396394
The selected share is neither tiny nor near one, so the notebook has enough observed and unobserved rows to show the selection problem. The target for DoubleMLSSM is the ATE over the full population, not just the selected rows.
Field Dictionary
A field dictionary is especially important for selected-outcome problems because it prevents a common mistake: treating the selection indicator as an ordinary covariate or dropping unselected rows before the selection model is built.
mar_field_dictionary = pd.DataFrame( [ {"field": "row_id", "role": "identifier", "description": "Unique row identifier."}, {"field": "treated", "role": "treatment", "description": "Binary treatment indicator D."}, {"field": "selected", "role": "selection indicator", "description": "Equals 1 if the outcome is observed."}, {"field": "outcome_observed", "role": "observed outcome", "description": "Outcome column passed to DoubleMLSSM; unselected rows are filled with 0 and marked by selected=0."}, {"field": "full_outcome", "role": "simulation truth", "description": "Outcome for every row, kept only because this is synthetic data."}, {"field": "engagement_score", "role": "covariate", "description": "Observed pre-treatment feature related to treatment, selection, and outcome."}, {"field": "baseline_value", "role": "covariate", "description": "Observed baseline outcome predictor."}, {"field": "support_need", "role": "covariate", "description": "Observed need/risk feature."}, {"field": "prior_usage", "role": "covariate", "description": "Observed prior activity signal."}, {"field": "mobile_user", "role": "covariate", "description": "Observed binary segment marker."}, {"field": "selection_encouragement", "role": "covariate in MAR design", "description": "Observed signal that predicts outcome observation."}, {"field": "true_tau", "role": "simulation truth", "description": "Individual treatment effect."}, {"field": "true_treatment_propensity", "role": "simulation truth", "description": "True P(D=1|X)."}, {"field": "true_selection_probability", "role": "simulation truth", "description": "True P(S=1|D,X)."}, ])save_table(mar_field_dictionary, f"{NOTEBOOK_PREFIX}_mar_field_dictionary.csv")display(mar_field_dictionary)
field
role
description
0
row_id
identifier
Unique row identifier.
1
treated
treatment
Binary treatment indicator D.
2
selected
selection indicator
Equals 1 if the outcome is observed.
3
outcome_observed
observed outcome
Outcome column passed to DoubleMLSSM; unselect...
4
full_outcome
simulation truth
Outcome for every row, kept only because this ...
5
engagement_score
covariate
Observed pre-treatment feature related to trea...
6
baseline_value
covariate
Observed baseline outcome predictor.
7
support_need
covariate
Observed need/risk feature.
8
prior_usage
covariate
Observed prior activity signal.
9
mobile_user
covariate
Observed binary segment marker.
10
selection_encouragement
covariate in MAR design
Observed signal that predicts outcome observat...
11
true_tau
simulation truth
Individual treatment effect.
12
true_treatment_propensity
simulation truth
True P(D=1|X).
13
true_selection_probability
simulation truth
True P(S=1|D,X).
Notice that selected is not part of the ordinary feature list. It has a special role through s_col, because it determines which outcomes are observed.
Data Audit
Before fitting a sample-selection model, check selection rate, treatment rate, selected-by-treatment cells, and missingness. The selection cells matter because outcome models are trained only where S = 1 within treatment arms.
Every treatment-selection cell has enough rows. If selected treated rows or selected control rows were rare, the outcome nuisance functions g(1, X) or g(0, X) would be weak.
Selection Rates And Outcome Visibility
The next plot shows two different facts at once: selected shares differ by treatment status, and observed selected outcomes differ from the full outcome distribution. In real data, the full distribution would be hidden, which is exactly why the design assumption matters.
Treatment affects selection, and selected rows have a shifted outcome distribution. A selected-row-only analysis is therefore risky because it changes the population being analyzed.
Covariate Balance And Selection Balance
Selection can distort the covariate distribution. The table below compares selected and unselected rows on observed features. Large differences do not make correction impossible, but they make the selection model more important.
The selected sample is not a random slice of the full sample. The encouragement and support-need variables are especially important for modeling selection.
Naive Baselines
The first baseline uses only selected rows and compares treated versus control outcomes. The second baseline uses the synthetic full outcome and is shown only for teaching. In real selected-outcome data, the second baseline cannot be computed.
The selected-row difference is far from the true ATE because it mixes treatment assignment, treatment heterogeneity, and outcome observation. Covariate adjustment helps, but it still does not explicitly model the missing-outcome process for the full target population.
Manual Cross-Fitted MAR Score
Before fitting DoubleMLSSM, we compute a small manual missing-at-random score. This makes the mechanics visible.
The code trains:
g(1, X) on selected treated rows;
g(0, X) on selected control rows;
pi(D, X) on all rows to predict selection;
m(X) on all rows to predict treatment.
Then it combines residualized observed outcomes with outcome-model predictions.
The manual score is much closer to the true ATE than the selected-row difference. That is the core lesson: selection correction uses unselected rows through the selection and treatment models, even though their outcomes are not observed.
Manual Score Diagnostics
The manual diagnostics check whether treatment and selection probabilities are extreme. The score contribution plot shows whether a few rows dominate the estimate.
The treatment and selection predictions are mostly away from 0 and 1. That matters because inverse-probability terms become unstable when predicted probabilities are too small.
DoubleML Sample-Selection Backend
DoubleMLSSMData is the data contract for sample-selection models. The selection column is passed as s_col. Unselected rows stay in the data because they are needed for the treatment and selection nuisance models.
This is the most important API cell in the notebook. If s_col is missing, the model no longer knows which outcomes are observed and the selected-outcome design is lost.
Fit DoubleMLSSM Under Missing At Random
We fit three versions:
linear nuisance learners;
gradient boosting nuisance learners;
gradient boosting with normalized inverse-probability weights.
The normalized version can be helpful when weights are noisy, but it is not a magic fix for poor overlap or wrong assumptions.
The DoubleML estimates are close to the true full-population ATE. The selected-row difference was much larger, which shows why the selection problem matters.
Estimate Comparison
The comparison plot puts the naive baselines, manual score, and DoubleML estimates on one axis. This is often the clearest way to communicate what selection correction changed.
The corrected estimates move toward the full-population ATE. The plot also shows that the ATT is a different target; selected-outcome correction does not remove the need to state the estimand.
Nuisance Learner Diagnostics
Now we inspect the nuisance learners. These diagnostics are not a proof of selection validity. They are checks that the prediction components are not obviously broken.
mar_loss_tables = []for label, model in mar_models.items(): mar_loss_tables.append(learner_loss_table(model, label))mar_nuisance_losses = pd.concat(mar_loss_tables, ignore_index=True)save_table(mar_nuisance_losses, f"{NOTEBOOK_PREFIX}_mar_nuisance_losses.csv")display(mar_nuisance_losses)
model
learner
metric_value
metric
0
MAR linear
ml_g_d0
2.238000
RMSE for outcome learners; classification erro...
1
MAR linear
ml_g_d1
2.837962
RMSE for outcome learners; classification erro...
2
MAR linear
ml_pi
0.446393
RMSE for outcome learners; classification erro...
3
MAR linear
ml_m
0.476925
RMSE for outcome learners; classification erro...
4
MAR HGB
ml_g_d0
2.252895
RMSE for outcome learners; classification erro...
5
MAR HGB
ml_g_d1
2.862023
RMSE for outcome learners; classification erro...
6
MAR HGB
ml_pi
0.448827
RMSE for outcome learners; classification erro...
7
MAR HGB
ml_m
0.480633
RMSE for outcome learners; classification erro...
8
MAR HGB normalized IPW
ml_g_d0
2.245353
RMSE for outcome learners; classification erro...
9
MAR HGB normalized IPW
ml_g_d1
2.863304
RMSE for outcome learners; classification erro...
10
MAR HGB normalized IPW
ml_pi
0.449762
RMSE for outcome learners; classification erro...
11
MAR HGB normalized IPW
ml_m
0.482784
RMSE for outcome learners; classification erro...
The outcome nuisance losses are measured on the outcome scale. The propensity learner losses summarize classification-style performance. If these losses were very poor, the final estimate would deserve more skepticism.
Prediction Quality Against Synthetic Truth
Because the data are synthetic, we can compare the HGB nuisance predictions to true treatment and selection probabilities. In real data, this truth is unavailable, but the same plots can still reveal extreme probabilities.
The probability predictions track the synthetic truth reasonably well. This does not guarantee a causal estimate, but it confirms that the nuisance models are learning the intended observed structure.
Treatment And Selection Overlap
The next plot is one of the most important diagnostics in a sample-selection analysis. Both the treatment propensity and the selection probability appear in denominators. Near-zero probabilities can create unstable weights.
The predicted probabilities have usable overlap and are not concentrated at the clipping boundaries. That is a good sign for this synthetic example.
Bootstrap Confidence Interval
The multiplier bootstrap gives another uncertainty view for the preferred MAR model. It quantifies sampling uncertainty under the maintained design assumptions.
The bootstrap interval should be reported alongside the point estimate, but it does not account for misspecified selection assumptions. Selection bias is a design concern, not just a sampling concern.
Selection Positivity Stress Test
This stress test shifts the selection equation so outcomes become easier or harder to observe. Lower selection rates usually increase uncertainty because fewer outcomes are available for the outcome regressions and inverse-selection correction.
The standard error tends to rise when fewer outcomes are observed. This is the practical cost of selected outcomes: even when the assumptions are correct, less outcome visibility means less information.
Nonignorable Selection Data
The next dataset adds a hidden response factor U. This hidden factor affects both outcome observation and the outcome itself. A missing-at-random analysis is now misspecified because conditioning on observed covariates is not enough.
We also add selection_encouragement, a variable that affects selection but does not directly affect the outcome. In the synthetic design, this variable satisfies the exclusion restriction. In real work, that restriction would require a domain argument.
Rows: 6,000
Selected share: 0.509
True nonignorable-design ATE: 1.0389
Saved data to: notebooks/tutorials/doubleml/outputs/datasets/08_synthetic_nonignorable_selection_data.csv
row_id
engagement_score
baseline_value
support_need
prior_usage
mobile_user
selection_encouragement
treated
selected
outcome_observed
full_outcome
true_tau
true_treatment_propensity
true_selection_probability
hidden_response_factor
0
0
-0.311445
-0.577787
-0.069680
0.874368
0
-1.616559
0
0
0.000000
1.737582
0.842439
0.388739
0.043021
-2.126387
1
1
-2.245198
0.178591
0.631997
-1.302306
1
-0.500702
0
0
0.000000
1.281574
0.482289
0.217218
0.075842
-0.188155
2
2
-0.214792
-0.104288
0.117372
-0.859742
0
0.019919
1
1
4.283298
4.283298
0.918922
0.428041
0.740645
1.478050
3
3
0.886864
0.208647
0.431936
-0.321071
1
-0.362617
1
1
5.454790
5.454790
1.289819
0.625093
0.550140
0.170938
4
4
0.174338
-0.218863
-0.605183
-0.134723
1
1.157343
1
0
0.000000
4.405956
1.151273
0.569752
0.851009
0.108234
This design deliberately violates missing-at-random selection. The hidden factor affects both selection and the outcome. The selection encouragement is available as an instrument-like variable for the nonignorable score.
Nonignorable Design Checks
Before fitting the nonignorable model, check that the selection encouragement is relevant for selection and not mechanically related to treatment. In synthetic data, we know it has no direct outcome effect, but applied work would need a real argument.
The selection encouragement is relevant for selection. The hidden factor is also related to selection and the outcome, which is why the missing-at-random analysis is no longer conceptually right.
MAR Versus Nonignorable Scores
Now we fit two models to the nonignorable dataset:
a missing-at-random score that treats the selection process as observed-covariate explainable;
a nonignorable score that uses selection_encouragement as a selection instrument.
The nonignorable backend passes selection_encouragement through z_cols instead of treating it as an ordinary covariate.
The nonignorable score uses extra structure to address a hidden selection path. The improvement is not automatic in every sample, and it depends on the selection encouragement being a credible exclusion variable.
Nonignorable Estimate Plot
The plot below keeps the nonignorable comparison separate from the missing-at-random dataset because the data-generating process is different. The point is not to crown a universal best model; it is to show that the score must match the selection story.
The figure makes the design choice visible. When hidden response behavior matters, a missing-at-random score is answering under the wrong assumption, even if it runs cleanly.
Combined Summary Table
This table collects the main estimates from the notebook. The rows should be compared within their design setting, not as interchangeable models.
combined_ssm_summary = pd.concat( [ baseline_summary.assign(section="MAR data baselines"), manual_mar_summary.assign(section="MAR data manual score"), mar_doubleml_summary.assign(section="MAR data DoubleML"), nonignorable_summary.assign(section="nonignorable data"), ], ignore_index=True, sort=False,)save_table(combined_ssm_summary, f"{NOTEBOOK_PREFIX}_combined_ssm_summary.csv")display(combined_ssm_summary)
estimator
theta_hat
std_error
ci_95_lower
ci_95_upper
true_target
bias_vs_target
section
design
0
Selected-row treated-control difference
1.675867
NaN
NaN
NaN
1.032326
0.643540
MAR data baselines
NaN
1
Synthetic full-population treated-control diff...
1.732946
NaN
NaN
NaN
1.032326
0.700620
MAR data baselines
NaN
2
Selected-row OLS with covariates
1.065852
0.026952
1.013027
1.118678
1.032326
0.033526
MAR data baselines
NaN
3
Manual cross-fitted MAR SSM
1.003513
0.030544
0.943647
1.063380
1.032326
-0.028813
MAR data manual score
NaN
4
MAR linear
1.001539
0.030564
0.941634
1.061444
1.032326
-0.030787
MAR data DoubleML
missing-at-random
5
MAR HGB
1.002785
0.036305
0.931628
1.073943
1.032326
-0.029541
MAR data DoubleML
missing-at-random
6
MAR HGB normalized IPW
1.023998
0.016952
0.990773
1.057222
1.032326
-0.008329
MAR data DoubleML
missing-at-random
7
MAR score on nonignorable data
1.027318
0.035579
0.957586
1.097051
1.038926
-0.011607
nonignorable data
nonignorable data
8
Nonignorable score with selection instrument
1.070113
0.044343
0.983202
1.157024
1.038926
0.031187
nonignorable data
nonignorable data
9
Nonignorable score HGB
1.041098
0.046897
0.949182
1.133014
1.038926
0.002172
nonignorable data
nonignorable data
The combined table shows the notebook’s main story: selected-row-only estimates can be very misleading, missing-at-random correction works when its assumptions match the data, and nonignorable selection requires stronger structure.
Reporting Checklist
A sample-selection report needs more than an effect estimate. The checklist below names the assumptions and diagnostics a reader should see.
ssm_reporting_checklist = pd.DataFrame( [ {"topic": "target population", "question": "Is the effect for all rows or selected rows only?", "notebook_answer": "The main target is the full-population ATE."}, {"topic": "selection definition", "question": "What does S=1 mean?", "notebook_answer": "The outcome is observed."}, {"topic": "unselected rows", "question": "Were unselected rows retained?", "notebook_answer": "Yes. They enter treatment and selection nuisance models."}, {"topic": "MAR assumption", "question": "Can observed D and X explain outcome observation?", "notebook_answer": "True in the MAR synthetic design, false in the nonignorable design."}, {"topic": "selection overlap", "question": "Are selection probabilities away from zero?", "notebook_answer": "Checked with selection-probability histograms and stress tests."}, {"topic": "treatment overlap", "question": "Are treatment propensities away from zero and one?", "notebook_answer": "Checked with treatment-propensity histograms."}, {"topic": "nonignorable selection", "question": "If MAR is not credible, is there a valid selection instrument?", "notebook_answer": "The tutorial uses a synthetic selection encouragement as z_cols."}, {"topic": "uncertainty", "question": "Are intervals and diagnostics reported?", "notebook_answer": "Yes. Standard errors, bootstrap CI, and nuisance diagnostics are saved."}, ])save_table(ssm_reporting_checklist, f"{NOTEBOOK_PREFIX}_ssm_reporting_checklist.csv")display(ssm_reporting_checklist)
topic
question
notebook_answer
0
target population
Is the effect for all rows or selected rows only?
The main target is the full-population ATE.
1
selection definition
What does S=1 mean?
The outcome is observed.
2
unselected rows
Were unselected rows retained?
Yes. They enter treatment and selection nuisan...
3
MAR assumption
Can observed D and X explain outcome observation?
True in the MAR synthetic design, false in the...
4
selection overlap
Are selection probabilities away from zero?
Checked with selection-probability histograms ...
5
treatment overlap
Are treatment propensities away from zero and ...
Checked with treatment-propensity histograms.
6
nonignorable selection
If MAR is not credible, is there a valid selec...
The tutorial uses a synthetic selection encour...
7
uncertainty
Are intervals and diagnostics reported?
Yes. Standard errors, bootstrap CI, and nuisan...
This checklist turns the notebook into a reusable analysis template. The estimate is only one part of the evidence; the selection story is equally important.
Report Template
The final report template writes the preferred MAR result and the nonignorable comparison in plain language. The language is cautious because sample selection assumptions are often the hardest part of the design.
best_mar = mar_doubleml_summary.loc[mar_doubleml_summary["estimator"] =="MAR HGB"].iloc[0]best_non = nonignorable_summary.loc[nonignorable_summary["estimator"] =="Nonignorable score with selection instrument"].iloc[0]report_text =f"""# Sample Selection DoubleML Report Template## QuestionEstimate the treatment effect for the full target population when the outcome is observed only for selected rows.## Preferred Missing-At-Random ResultThe preferred MAR estimate uses `DoubleMLSSM` with gradient boosting nuisance learners.- Estimate: {best_mar['theta_hat']:.4f}- Standard error: {best_mar['std_error']:.4f}- 95 percent CI: [{best_mar['ci_95_lower']:.4f}, {best_mar['ci_95_upper']:.4f}]- Synthetic true ATE: {TRUE_MAR_ATE:.4f}## Nonignorable Selection DemonstrationWhen selection depends on a hidden response factor, a MAR score is not conceptually sufficient. The nonignorable score uses a selection encouragement variable as an exclusion variable.- Nonignorable estimate: {best_non['theta_hat']:.4f}- Standard error: {best_non['std_error']:.4f}- 95 percent CI: [{best_non['ci_95_lower']:.4f}, {best_non['ci_95_upper']:.4f}]- Synthetic true ATE: {TRUE_NON_ATE:.4f}## Assumptions To State- Treatment assignment is unconfounded after conditioning on observed covariates.- Outcome observation is missing at random after conditioning on treatment and covariates, unless using the nonignorable score.- Treatment and selection probabilities have adequate overlap.- For nonignorable selection, the selection encouragement affects outcome observation and has no direct outcome effect.## Diagnostics To Include- Selected versus unselected covariate balance.- Treatment and selection probability overlap.- Selected-row baseline estimates versus selection-adjusted estimates.- Nuisance learner diagnostics.- Stress test showing how lower outcome visibility affects uncertainty.""".strip()report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_ssm_report_template.md"report_path.write_text(report_text)print(report_text)
# Sample Selection DoubleML Report Template
## Question
Estimate the treatment effect for the full target population when the outcome is observed only for selected rows.
## Preferred Missing-At-Random Result
The preferred MAR estimate uses `DoubleMLSSM` with gradient boosting nuisance learners.
- Estimate: 1.0028
- Standard error: 0.0363
- 95 percent CI: [0.9316, 1.0739]
- Synthetic true ATE: 1.0323
## Nonignorable Selection Demonstration
When selection depends on a hidden response factor, a MAR score is not conceptually sufficient. The nonignorable score uses a selection encouragement variable as an exclusion variable.
- Nonignorable estimate: 1.0701
- Standard error: 0.0443
- 95 percent CI: [0.9832, 1.1570]
- Synthetic true ATE: 1.0389
## Assumptions To State
- Treatment assignment is unconfounded after conditioning on observed covariates.
- Outcome observation is missing at random after conditioning on treatment and covariates, unless using the nonignorable score.
- Treatment and selection probabilities have adequate overlap.
- For nonignorable selection, the selection encouragement affects outcome observation and has no direct outcome effect.
## Diagnostics To Include
- Selected versus unselected covariate balance.
- Treatment and selection probability overlap.
- Selected-row baseline estimates versus selection-adjusted estimates.
- Nuisance learner diagnostics.
- Stress test showing how lower outcome visibility affects uncertainty.
The report template keeps the estimate tied to its assumptions. That is especially important for selected-outcome data, where the missing outcomes are precisely the part we cannot inspect directly.
Artifact Manifest
The manifest lists every dataset, figure, table, and report produced by this notebook. This makes the tutorial easy to audit and reuse.
The sample-selection notebook is complete. The next natural topic is regression discontinuity design, where identification comes from continuity around a cutoff rather than selection correction, instruments, or parallel trends.