This notebook is a full tutorial on DoubleMLIIVM, the interactive instrumental-variable model for a binary treatment and a binary instrument.
The previous notebook handled binary treatments under observed-control adjustment. Here we add noncompliance and instrument-induced treatment changes. The treatment is binary, the instrument is binary, and the target is a local effect: the effect for units whose treatment status is shifted by the instrument.
The main point is conceptual as much as computational. Binary IV estimates are not ordinary ATE estimates. They are local effects tied to compliance behavior and instrument validity.
Learning Goals
By the end of this notebook, you should be able to:
Explain the IIVM setup for binary treatment and binary instrument data.
Distinguish ATE, ATT-style targets, and LATE.
Explain compliers, always-takers, never-takers, and defiers.
State the IV assumptions for binary instruments: relevance, exclusion, independence after controls, and monotonicity.
Understand the DoubleMLIIVM nuisance roles: ml_g, ml_m, and ml_r.
Manually compute the cross-fitted IIVM ratio score.
Fit DoubleMLIIVM and diagnose first-stage strength, instrument overlap, nuisance losses, score contributions, and split stability.
Explain why weak instruments, invalid instruments, or wrong subgroup assumptions change the meaning of the estimate.
From ATE To LATE
For a binary treatment, the ATE is:
\[
E[Y(1)-Y(0)].
\]
That target averages treatment effects over everyone. In a binary IV design with noncompliance, the instrument does not necessarily move everyone into treatment. Some units take treatment regardless of the instrument, some never take treatment, and some take treatment only when encouraged.
The local average treatment effect is the treatment effect for the instrument-responsive units:
\[
LATE = E[Y(1)-Y(0) \mid \text{complier}].
\]
When compliance probability varies with covariates, the model target is a compliance-weighted local effect. In the synthetic data we can compute that target exactly; in real data we usually cannot observe compliance type directly.
Compliance Types
With a binary instrument Z and binary treatment D, define two potential treatment states:
\[
D(1) \quad \text{and} \quad D(0).
\]
D(1) is whether the unit would take treatment if instrumented. D(0) is whether the unit would take treatment if not instrumented.
The common compliance types are:
Complier: D(1)=1, D(0)=0.
Always-taker: D(1)=1, D(0)=1.
Never-taker: D(1)=0, D(0)=0.
Defier: D(1)=0, D(0)=1.
The usual monotonicity assumption rules out defiers: the instrument should not make anyone less likely to take treatment.
IIVM Assumptions
A binary IV design needs several assumptions:
Relevance: the instrument changes treatment probability after adjusting for controls.
Exclusion: the instrument affects the outcome only through treatment.
Conditional independence: after adjusting for controls, the instrument is independent of potential outcomes and potential treatment states.
Monotonicity: the instrument does not push some units in the opposite direction.
Overlap for the instrument: every relevant covariate profile has a nonzero chance of both instrument states.
The data can help screen relevance and overlap. Exclusion, independence, and monotonicity require design evidence.
This notebook fits several cross-fitted binary IV models, a manual IIVM estimator, subgroup comparisons, and a repeated-splitting check. On a typical laptop, the full run should take roughly two to five minutes.
The stress tests use synthetic oracle calculations where possible so that the conceptual lessons stay fast.
Setup
This cell prepares the notebook environment. It creates output folders, keeps matplotlib cache files inside the tutorial output folder, imports scientific Python libraries, and records package versions.
The path logic supports running from the repository root or directly from the tutorial folder.
The package table is saved as part of the reproducibility record. Binary IV workflows depend on classifier behavior, cross-fitting splits, and package defaults.
Helper Functions
The next cell defines utilities for saving tables, computing simple baselines, computing manual ratio scores, cross-fitting IIVM nuisance functions, and extracting DoubleML diagnostics.
The manual IIVM formula is the most important helper. It estimates a numerator effect of the instrument on the outcome and a denominator effect of the instrument on treatment, then takes their ratio.
The manual ratio helper mirrors the IIVM target: an instrument-outcome effect divided by an instrument-treatment effect. The denominator is the first stage. If that denominator is small, the estimate becomes unstable.
Draw The Binary IV Design
The diagram below shows the core binary IV story.
The instrument Z shifts treatment D. Treatment affects outcome Y. Observed controls X may affect all three. Hidden factors can affect treatment and outcome, which is why observed-control IRM may fail. The IV design requires that the hidden factors do not also drive the instrument after conditioning on X, and that the instrument has no direct outcome path.
The diagram makes clear why this is not the same as the IRM notebook. The instrument creates a local source of treatment variation, and the estimate is tied to that variation.
Create A Teaching Dataset With Noncompliance
We now simulate a binary instrument and a binary treatment.
The instrument is an encouragement. Some units always take treatment, some never take treatment, and some comply with the encouragement. There are no defiers, so monotonicity holds by construction.
The outcome depends on the actual treatment, observed controls, and a latent factor related to compliance type. This makes direct treatment comparisons biased, but the instrument remains valid after conditioning on observed controls.
Saved synthetic IIVM data with shape (5000, 26)
Instrument rate: 0.463
Treatment rate: 0.392
True ATE: 1.021
True compliance-weighted LATE: 1.102
True realized complier effect: 1.098
True first stage: 0.448
engagement_score
need_intensity
content_fit
recent_activity
price_sensitivity
tenure_signal
novelty_appetite
seasonality_signal
encouragement
feature_exposure
...
true_r0
true_r1
true_g0
true_g1
true_mu0
true_mu1
true_tau
latent_compliance_factor
potential_d0
potential_d1
0
-0.793122
0.240571
-1.896326
1.395772
0.638295
-0.292047
-0.311949
0.303835
0
0
...
0.133788
0.543873
-1.272362
-0.926958
-1.385047
-0.542774
0.842273
-0.077700
0
1
1
-0.267660
-0.225909
0.720068
0.514705
-0.064128
-0.085477
0.160916
-0.614018
0
0
...
0.161699
0.599455
0.076353
0.541036
-0.095292
0.966220
1.061512
-0.137787
0
0
2
-0.403750
0.548260
-0.130483
-1.374426
-0.477279
0.656622
-0.232283
-0.148733
0
0
...
0.153179
0.690767
0.238566
0.834622
0.068727
1.177487
1.108760
-0.009748
0
0
3
0.641837
1.824610
-0.713189
1.348207
-1.230013
0.174978
-1.169530
1.351458
0
0
...
0.219630
0.797019
2.368826
3.295216
2.016441
3.620887
1.604447
-0.259689
0
1
4
0.833923
1.137717
-0.885533
0.684555
-0.519013
-0.457385
0.506537
0.876718
1
1
...
0.227780
0.884706
1.558963
2.568380
1.208961
2.745538
1.536577
0.222783
0
1
5 rows × 26 columns
The true ATE and true local effect differ because the instrument-responsive units have a different treatment-effect mix than the full population. That is the key conceptual difference between IRM and IIVM.
Field Dictionary
This table documents the roles of the important columns. The hidden teaching columns are not available in real applications; they are present only so the tutorial can compare estimates with known synthetic truth.
The instrument and treatment both have adequate support. The true first stage is large enough for a stable teaching example.
Compliance Type Mix
The next table summarizes the synthetic compliance types. In real data, we generally cannot label individuals this way because we observe only one instrument state per unit.
The table is useful for intuition: only compliers identify the local treatment effect under the standard monotonic IV story.
The mean true effect differs across compliance types. This is why the local effect can differ from the full-population ATE.
Visualize Compliance And Instrument Take-Up
The left panel shows treatment rates by instrument status. The right panel shows the true individual effects by hidden compliance type.
A binary IV design needs the instrument to create a treatment-rate gap. The effect distribution reminds us that the local target can differ from the population average.
from matplotlib.lines import Line2Dfig, axes = plt.subplots(1, 2, figsize=(13, 5))takeup = ( iivm_df.groupby("encouragement") .agg(treatment_rate=("feature_exposure", "mean"), n=("feature_exposure", "size")) .reset_index())sns.barplot(data=takeup, x="encouragement", y="treatment_rate", color="#60a5fa", ax=axes[0])axes[0].set_title("Treatment Take-Up By Instrument Status")axes[0].set_xlabel("Instrument Z")axes[0].set_ylabel("Treatment Rate")axes[0].set_ylim(0, 1)# Seaborn's automatic hue legend can be overwritten after adding Matplotlib reference lines.# We draw the KDE curves without an automatic legend, then build one combined legend manually.compliance_order = ["complier", "always_taker", "never_taker"]compliance_order = [group for group in compliance_order if group in iivm_df["compliance_type"].unique()]compliance_palette =dict(zip(compliance_order, sns.color_palette("Set2", n_colors=len(compliance_order))))sns.kdeplot( data=iivm_df, x="true_tau", hue="compliance_type", hue_order=compliance_order, palette=compliance_palette, common_norm=False, linewidth=2, legend=False, ax=axes[1],)axes[1].axvline(TRUE_ATE, color="#1d4ed8", linestyle="--", linewidth=1.5)axes[1].axvline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5)axes[1].set_title("Treatment Effect Heterogeneity By Compliance Type")axes[1].set_xlabel("True Individual Effect")legend_handles = [ Line2D([0], [0], color=compliance_palette[group], linewidth=2, label=group.replace("_", " ").title())for group in compliance_order]legend_handles.extend( [ Line2D([0], [0], color="#1d4ed8", linestyle="--", linewidth=1.5, label="True ATE"), Line2D([0], [0], color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE"), ])axes[1].legend(handles=legend_handles, title="Curve / reference", frameon=True, loc="best")plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_compliance_and_take_up.png", dpi=160, bbox_inches="tight")plt.show()
The instrument creates a clear treatment-rate difference. The effect distributions show why the local effect is not just a technicality; it can be a meaningfully different causal target.
Instrument Overlap
The instrument propensity should not be too close to zero or one. If some covariate profiles almost always receive the same instrument status, there is little local comparison for those profiles.
This cell summarizes the true instrument propensity and displays its distribution by observed instrument status.
The instrument overlap is healthy in this simulation. In real data, poor instrument overlap can make the local effect unstable or limit the population for which the estimate is meaningful.
Correlation Screen
The correlation matrix gives a quick view of relationships among outcome, treatment, instrument, compliance probabilities, and controls.
Correlation is not identification. It is just a useful screen before fitting models.
The treatment is related to the latent compliance factor, which is why direct treatment comparisons are biased. The instrument is designed to be usable after adjustment for controls.
Baseline Estimators
We compare several simple estimators before fitting DoubleML:
Difference in means by treatment status.
OLS adjustment using treatment and observed controls.
Raw Wald ratio using instrument groups without controls.
Linear residualized Wald ratio.
Oracle IIVM ratio using the true nuisance functions.
The oracle estimator is available only in this synthetic notebook.
y = iivm_df["weekly_value"].to_numpy()d = iivm_df["feature_exposure"].to_numpy()z = iivm_df["encouragement"].to_numpy()X = iivm_df[feature_cols]baseline_rows = []baseline_rows.append(mean_difference_summary(y, d, "Difference in means by treatment", target="direct treatment comparison"))baseline_rows.append(treatment_ols_summary(iivm_df["weekly_value"], iivm_df[["feature_exposure"] + feature_cols], "feature_exposure", "OLS with raw controls"))baseline_rows.append(wald_summary(y, d, z, "Raw Wald ratio"))baseline_rows.append(residualized_wald_summary(y, d, z, X, "Linear residualized Wald ratio"))baseline_rows.append( iivm_ratio_summary( y=y, d=d, z=z, g0_hat=iivm_df["true_g0"].to_numpy(), g1_hat=iivm_df["true_g1"].to_numpy(), m_hat=iivm_df["true_instrument_propensity"].to_numpy(), r0_hat=iivm_df["true_r0"].to_numpy(), r1_hat=iivm_df["true_r1"].to_numpy(), label="Oracle IIVM ratio", ))baseline_estimates = pd.DataFrame(baseline_rows)baseline_estimates["true_target"] = np.where( baseline_estimates["target"].str.contains("treatment", case=False, na=False), TRUE_ATE, TRUE_LATE,)baseline_estimates.loc[baseline_estimates["estimator"].eq("OLS with raw controls"), "true_target"] = TRUE_ATEbaseline_estimates["bias_vs_target"] = baseline_estimates["theta_hat"] - baseline_estimates["true_target"]save_table(baseline_estimates, "baseline_estimates")display(baseline_estimates)
estimator
target
theta_hat
std_error
ci_95_lower
ci_95_upper
p_value
wald_numerator
wald_denominator
iivm_numerator
iivm_denominator
true_target
bias_vs_target
0
Difference in means by treatment
direct treatment comparison
1.523229
0.040880
1.443103
1.603354
NaN
NaN
NaN
NaN
NaN
1.020535
0.502693
1
OLS with raw controls
observed-control treatment slope
1.223530
0.037648
1.149742
1.297318
1.082914e-231
NaN
NaN
NaN
NaN
1.020535
0.202994
2
Raw Wald ratio
raw Wald LATE-style ratio
1.849634
0.080037
1.692762
2.006506
NaN
0.924632
0.499900
NaN
NaN
1.101928
0.747706
3
Linear residualized Wald ratio
linear residualized Wald ratio
1.178991
0.078897
1.024353
1.333629
NaN
0.124390
0.105505
NaN
NaN
1.101928
0.077064
4
Oracle IIVM ratio
LATE
1.116161
0.064275
0.990182
1.242139
NaN
NaN
NaN
0.527789
0.472861
1.101928
0.014233
Direct treatment comparisons are biased because treatment selection is related to latent compliance factors. The raw Wald ratio is also off because the instrument depends on observed controls. Adjustment matters for the binary IV design.
Baseline Estimate Plot
The plot compares baseline estimates against two reference lines: the true ATE and the true local effect.
The local effect is the correct reference for IV-style rows. The ATE is the reference for direct treatment-comparison rows.
The figure shows why IIVM has its own notebook. Binary IV estimates answer a local instrument-induced question, not the same question as a binary-treatment IRM ATE.
Nuisance Learners
DoubleMLIIVM uses three nuisance learner roles:
ml_g: outcome regression as a function of instrument status and controls.
We compare a linear/logistic baseline with gradient boosting. The outcome nuisance is continuous, while the instrument and treatment nuisances are classifiers.
The nuisance learners estimate the components of the ratio score. The final target remains the local treatment effect, not the predictive performance itself.
Manual Cross-Fitted IIVM
Before fitting DoubleML, we manually compute the cross-fitted IIVM ratio.
The numerator estimates how the instrument changes the outcome after adjusting for controls. The denominator estimates how the instrument changes treatment after adjusting for controls. Their ratio estimates the local treatment effect.
The manual cross-fitted ratio is the IIVM idea in plain Python. DoubleML automates the same structure with score management, inference, repeated splits, and diagnostics.
Manual Nuisance Quality
Since this is synthetic data, we can compare cross-fitted nuisance predictions to the true nuisance functions.
Real applications cannot do this. They should use out-of-fold loss, instrument overlap, first-stage diagnostics, and design evidence.
The nuisance predictions are useful but imperfect. The local ratio estimate is designed to be less sensitive to small nuisance errors, but weak nuisance quality can still create instability.
Manual First-Stage Diagnostics
The denominator of the IIVM ratio is the estimated instrument effect on treatment. This is the first stage.
If the denominator is close to zero, the local effect becomes unstable. This cell summarizes the manual cross-fitted first stage.
Finished: Linear nuisance IIVM
Finished: Gradient boosting nuisance IIVM
estimator
treatment
theta_hat
std_error
t_stat
p_value
ci_95_lower
ci_95_upper
true_target
bias_vs_target
0
Linear nuisance IIVM
feature_exposure
1.166017
0.080283
14.523824
8.559872e-48
1.008666
1.323369
1.101928
0.064090
1
Gradient boosting nuisance IIVM
feature_exposure
1.206022
0.081519
14.794455
1.590685e-49
1.046249
1.365795
1.101928
0.104094
The gradient-boosted model is expected to do better because the synthetic compliance and outcome functions are nonlinear. The linear/logistic model is still useful as a transparent baseline.
Compare All Estimators
This table combines direct treatment comparisons, Wald-style baselines, the manual cross-fitted IIVM estimate, and DoubleML estimates.
The key comparison is whether an estimator targets the full-population treatment effect or the local instrument-induced effect.
The adjusted local-effect estimators are much closer to the true local target than direct treatment comparisons. The table also shows why the ATE and local target should not be mixed casually.
Estimate Comparison Plot
The figure compares estimates with confidence intervals. The blue line marks the true ATE. The red line marks the true local effect.
IIVM rows should be judged against the red line, not the blue line.
The nuisance predictions are good enough for the teaching design, but not perfect. The first-stage nuisance functions are especially important because they form the denominator of the local-effect ratio.
DoubleML First-Stage Diagnostics
This cell summarizes the DoubleML-estimated first stage r1_hat(X) - r0_hat(X).
A strong positive first stage supports relevance and monotonicity-style behavior in the fitted nuisance functions.
The estimated first stage is positive on average. A large share of negative estimated first stages would be a warning sign for monotonicity, learner instability, or weak local treatment movement.
First-Stage And Propensity Plots
The left panel shows the estimated first-stage distribution. The right panel shows the estimated instrument propensity by observed instrument status.
Together these plots check relevance and instrument overlap.
The first-stage distribution is mostly positive and the instrument propensity has adequate overlap. These are necessary diagnostics, not sufficient proof of IV validity.
Visual Nuisance Diagnostics
The next figure compares DoubleML’s gradient-boosted nuisance predictions with true synthetic nuisance functions.
In real data, this truth comparison is not possible. The analogous workflow is to inspect out-of-fold losses, calibration, overlap, and first-stage stability.
The nuisance plots show the model captures the main patterns but smooths some extremes. That is normal for finite-sample machine-learning nuisance estimation.
Score Contributions
DoubleML stores orthogonal score contributions in psi. Large tails can signal influential observations, unstable instrument propensities, or weak first-stage regions.
This cell summarizes and plots score contributions from the main gradient-boosted IIVM fit.
The bootstrap interval quantifies sampling uncertainty for the local effect under the fitted design. It should be reported with first-stage and instrument-validity discussion.
Repeated Sample Splitting
Cross-fitted estimates can move with different fold splits. Repeated sample splitting checks numerical stability.
This cell uses a slightly lighter gradient-boosting configuration so the check remains practical.
The estimate is stable across repeated splits in this synthetic design. That supports numerical reliability, not instrument validity by itself.
Subgroup Assumption Variants
DoubleMLIIVM has a subgroups argument for designs where always-takers or never-takers are known to be absent.
Our synthetic data contains both always-takers and never-takers. The default setting is therefore the right one. The next cell fits alternative subgroup restrictions to show that incorrect compliance restrictions can change the estimate.
subgroup_specs = {"Default: always and never possible": None,"Assume no always-takers": {"always_takers": False, "never_takers": True},"Assume no never-takers": {"always_takers": True, "never_takers": False},"Assume only compliers": {"always_takers": False, "never_takers": False},}subgroup_rows = []for label, subgroup_config in subgroup_specs.items(): model = DoubleMLIIVM( dml_data, ml_g=clone(repeated_outcome_learner), ml_m=clone(repeated_instrument_learner), ml_r=clone(repeated_treatment_learner), n_folds=5, n_rep=1, score="LATE", subgroups=subgroup_config, ) model.fit(store_predictions=False) row = model.summary.reset_index().rename(columns={"index": "treatment"}).iloc[0] subgroup_rows.append( {"subgroup_assumption": label,"theta_hat": row["coef"],"std_error": row["std err"],"ci_95_lower": row["2.5 %"],"ci_95_upper": row["97.5 %"],"true_late": TRUE_LATE,"bias_vs_true_late": row["coef"] - TRUE_LATE, } )subgroup_comparison = pd.DataFrame(subgroup_rows)save_table(subgroup_comparison, "subgroup_assumption_comparison")display(subgroup_comparison)
subgroup_assumption
theta_hat
std_error
ci_95_lower
ci_95_upper
true_late
bias_vs_true_late
0
Default: always and never possible
1.145934
0.077302
0.994425
1.297442
1.101928
0.044006
1
Assume no always-takers
1.148685
0.078316
0.995190
1.302181
1.101928
0.046758
2
Assume no never-takers
1.170313
0.083855
1.005961
1.334666
1.101928
0.068385
3
Assume only compliers
1.214532
0.085916
1.046140
1.382925
1.101928
0.112605
The subgroup variants are not interchangeable. Use them only when the design genuinely rules out specific compliance types.
First-Stage Stress Test
A weak instrument creates a small denominator in the IIVM ratio. Even with perfect nuisance functions, weak first stages inflate uncertainty and can make estimates unstable.
This synthetic stress test changes only the complier share and uses oracle nuisance functions so the first-stage issue is isolated.
The first stage is the denominator of the estimate. When it gets small, a local-effect estimate can become noisy even if the instrument is valid.
Exclusion-Violation Stress Test
Now we break the exclusion restriction by adding a direct effect of the instrument to the outcome.
The calculation uses oracle nuisance functions so the design violation is isolated. If the instrument affects the outcome directly, the IV ratio no longer recovers the treatment effect for compliers.
The estimate moves as the direct instrument effect grows. This is the central IV warning: a strong first stage does not protect against a bad exclusion restriction.
Exclusion-Violation Plot
The next figure visualizes how direct instrument effects bias the local-effect estimate.
fig, ax = plt.subplots(figsize=(8.5, 5))sns.lineplot( data=exclusion_violation_table, x="direct_instrument_effect", y="theta_hat_oracle_iivm", marker="o", color="#b45309", ax=ax,)ax.fill_between( exclusion_violation_table["direct_instrument_effect"], exclusion_violation_table["ci_95_lower"], exclusion_violation_table["ci_95_upper"], color="#f59e0b", alpha=0.18,)ax.axhline(TRUE_LATE, color="#b91c1c", linestyle="--", linewidth=1.5, label="True LATE without direct effect")ax.set_title("Bias From Exclusion Restriction Violations")ax.set_xlabel("Direct Effect Of Instrument On Outcome")ax.set_ylabel("Estimated Local Effect")ax.legend()plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_exclusion_violation_stress_test.png", dpi=160, bbox_inches="tight")plt.show()
This plot shows why instrument design has to be argued carefully. The model can estimate a ratio, but the ratio is causal only under the IV assumptions.
IIVM Compared With IRM
A tempting mistake is to fit a binary-treatment IRM model and treat it as equivalent to IIVM. They target different quantities.
This cell fits a gradient-boosted IRM ATE on the same observed treatment and controls. Because treatment is confounded by compliance-related latent factors, the IRM ATE is not the right design here.
irm_comparison = DoubleMLIRM( DoubleMLData( iivm_df[["weekly_value", "feature_exposure"] + feature_cols], y_col="weekly_value", d_cols="feature_exposure", x_cols=feature_cols, ), ml_g=clone(hgb_outcome_learner), ml_m=clone(hgb_treatment_learner), n_folds=5, n_rep=1, score="ATE",)irm_comparison.fit(store_predictions=True)irm_vs_iivm = pd.concat( [ irm_comparison.summary.reset_index().rename(columns={"index": "treatment"}).assign(model="IRM ATE on observed treatment", target_reference=TRUE_ATE), iivm_hgb.summary.reset_index().rename(columns={"index": "treatment"}).assign(model="IIVM LATE with instrument", target_reference=TRUE_LATE), ], ignore_index=True,)irm_vs_iivm["bias_vs_reference"] = irm_vs_iivm["coef"] - irm_vs_iivm["target_reference"]save_table(irm_vs_iivm, "irm_vs_iivm_comparison")display(irm_vs_iivm)
treatment
coef
std err
t
P>|t|
2.5 %
97.5 %
model
target_reference
bias_vs_reference
0
feature_exposure
1.192215
0.039338
30.307254
9.199046e-202
1.115115
1.269315
IRM ATE on observed treatment
1.020535
0.171679
1
feature_exposure
1.206022
0.081519
14.794455
1.590685e-49
1.046249
1.365795
IIVM LATE with instrument
1.101928
0.104094
The IRM and IIVM estimates answer different questions. In this simulation, the IIVM design is the appropriate response to noncompliance and latent treatment selection.
When IIVM Is The Right Or Wrong Tool
IIVM is useful when:
treatment is binary;
instrument is binary;
noncompliance is central;
the instrument plausibly shifts treatment but does not directly affect outcome;
the desired target is a local effect for instrument-responsive units.
IIVM is not enough when:
the instrument is weak;
the instrument directly affects the outcome;
defiers are plausible and monotonicity is not defensible;
the desired target is the full-population ATE;
treatment or instrument are continuous rather than binary.
Reporting Checklist
A useful IIVM report should make the local-effect meaning impossible to miss.
The checklist below captures the design and diagnostics needed for binary IV reporting.
reporting_checklist = pd.DataFrame( [ {"item": "Causal question", "status": "Estimate the local effect of feature_exposure on weekly_value for instrument-responsive units."}, {"item": "Treatment and instrument", "status": "Both feature_exposure and encouragement are binary."}, {"item": "Target", "status": "LATE, not full-population ATE."}, {"item": "Instrument relevance", "status": "First-stage take-up diagnostics reported."}, {"item": "Instrument overlap", "status": "Instrument propensity diagnostics reported."}, {"item": "Exclusion", "status": "Valid by construction in the synthetic data; requires design evidence in real data."}, {"item": "Conditional independence", "status": "Valid by construction after X in the synthetic data; requires design evidence in real data."}, {"item": "Monotonicity", "status": "No defiers by construction; requires design evidence in real data."}, {"item": "Nuisance learners", "status": "Compared linear/logistic and gradient-boosted nuisance sets."}, {"item": "Cross-fitting", "status": "Manual ratio and DoubleMLIIVM use cross-fitted nuisance predictions."}, {"item": "Uncertainty", "status": "Standard errors, confidence intervals, bootstrap interval, and split stability included."}, {"item": "Stress tests", "status": "Weak first-stage and exclusion-violation stress tests included."}, ])save_table(reporting_checklist, "iivm_reporting_checklist")display(reporting_checklist)
item
status
0
Causal question
Estimate the local effect of feature_exposure ...
1
Treatment and instrument
Both feature_exposure and encouragement are bi...
2
Target
LATE, not full-population ATE.
3
Instrument relevance
First-stage take-up diagnostics reported.
4
Instrument overlap
Instrument propensity diagnostics reported.
5
Exclusion
Valid by construction in the synthetic data; r...
6
Conditional independence
Valid by construction after X in the synthetic...
7
Monotonicity
No defiers by construction; requires design ev...
8
Nuisance learners
Compared linear/logistic and gradient-boosted ...
9
Cross-fitting
Manual ratio and DoubleMLIIVM use cross-fitted...
10
Uncertainty
Standard errors, confidence intervals, bootstr...
11
Stress tests
Weak first-stage and exclusion-violation stres...
The checklist separates estimable diagnostics from assumptions. That separation matters even more for IV than for ordinary observed-control designs.
Report Template
The next cell writes a short markdown report template using the main gradient-boosted IIVM estimate.
This template keeps the local-effect target, first stage, instrument assumptions, and limitations close to the numeric estimate.
main_row = iivm_summary.loc[iivm_summary["estimator"] =="Gradient boosting nuisance IIVM"].iloc[0]first_stage_mean = dml_first_stage_summary.loc[ dml_first_stage_summary["diagnostic"] =="mean first stage r1_hat - r0_hat", "value"].iloc[0]report_text =f"""# IIVM Local Effect Estimate Report Template## Causal QuestionEstimate the local effect of `feature_exposure` on `weekly_value` for units whose treatment status is shifted by `encouragement`.## TargetThe target is LATE, not the full-population ATE. The estimate applies to the instrument-responsive margin under the IV assumptions.## Main Estimate- Estimated local effect: {main_row['theta_hat']:.4f}- Standard error: {main_row['std_error']:.4f}- 95 percent confidence interval: [{main_row['ci_95_lower']:.4f}, {main_row['ci_95_upper']:.4f}]## First Stage- Mean estimated first stage: {first_stage_mean:.4f}## EstimatorThe main estimator is `DoubleMLIIVM` with five-fold cross-fitting, histogram gradient-boosted outcome nuisance models, and histogram gradient-boosted classifiers for instrument and treatment take-up nuisance models.## Diagnostics Included- Direct treatment comparisons, raw Wald, residualized Wald, oracle ratio, manual cross-fitted IIVM, and DoubleML IIVM comparisons.- Compliance-type summary for the synthetic teaching data.- Instrument overlap diagnostics.- First-stage distribution diagnostics.- Nuisance learner losses and prediction checks.- Orthogonal score contribution checks.- Bootstrap confidence interval.- Repeated sample-splitting stability.- Subgroup assumption variants.- Weak first-stage and exclusion-violation stress tests.## Required AssumptionsThe local effect is causal only if the instrument is relevant, has no direct effect on the outcome, is conditionally independent after controls, satisfies monotonicity, and has adequate overlap. DoubleML estimates the score under these assumptions; it does not establish them.""".strip()report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_iivm_report_template.md"report_path.write_text(report_text)print(report_text)
# IIVM Local Effect Estimate Report Template
## Causal Question
Estimate the local effect of `feature_exposure` on `weekly_value` for units whose treatment status is shifted by `encouragement`.
## Target
The target is LATE, not the full-population ATE. The estimate applies to the instrument-responsive margin under the IV assumptions.
## Main Estimate
- Estimated local effect: 1.2060
- Standard error: 0.0815
- 95 percent confidence interval: [1.0462, 1.3658]
## First Stage
- Mean estimated first stage: 0.5091
## Estimator
The main estimator is `DoubleMLIIVM` with five-fold cross-fitting, histogram gradient-boosted outcome nuisance models, and histogram gradient-boosted classifiers for instrument and treatment take-up nuisance models.
## Diagnostics Included
- Direct treatment comparisons, raw Wald, residualized Wald, oracle ratio, manual cross-fitted IIVM, and DoubleML IIVM comparisons.
- Compliance-type summary for the synthetic teaching data.
- Instrument overlap diagnostics.
- First-stage distribution diagnostics.
- Nuisance learner losses and prediction checks.
- Orthogonal score contribution checks.
- Bootstrap confidence interval.
- Repeated sample-splitting stability.
- Subgroup assumption variants.
- Weak first-stage and exclusion-violation stress tests.
## Required Assumptions
The local effect is causal only if the instrument is relevant, has no direct effect on the outcome, is conditionally independent after controls, satisfies monotonicity, and has adequate overlap. DoubleML estimates the score under these assumptions; it does not establish them.
The report template is deliberately explicit about local-effect meaning. That is the most common place where binary IV results get overclaimed.
Artifact Manifest
The final cell lists every artifact produced by this notebook so the outputs are easy to find later.
The IIVM notebook is complete. The next natural topic is difference-in-differences, where the identifying assumption shifts from observed-control adjustment or instruments to parallel trends.