DoubleML Tutorial 13: Sensitivity Analysis For Unobserved Confounding
This notebook is about a hard truth in causal inference: good machine learning does not remove the need for good identification assumptions. DoubleML can adjust flexibly for observed controls, but it still relies on a version of unconfoundedness given the observed controls. If an important common cause of treatment and outcome is missing, the estimate can be biased.
Sensitivity analysis asks a disciplined question:
How strong would hidden confounding need to be to materially change the conclusion?
DoubleML’s sensitivity tools do not discover the hidden variable. They do not prove that unobserved confounding is absent. Instead, they create a structured stress test. The stress test is controlled by three main ideas:
cf_y: how much residual outcome variation a hidden confounder could explain;
cf_d: how much the hidden confounder could change the treatment-side score representation;
rho: how adversarially the outcome-side and treatment-side hidden components are aligned.
The most conservative setting is usually rho = 1.0, meaning the hidden confounding works in the direction that most weakens the reported conclusion.
This notebook uses a synthetic PLR design where we know the hidden confounder. We fit three models:
observed controls only,
observed controls plus imperfect proxy controls,
an oracle model that includes the hidden confounder.
Only the first two are realistic. The oracle model exists so students can see what hidden confounding is doing. After fitting the realistic proxy-adjusted model, we run DoubleML sensitivity analysis, calibrate the sensitivity parameters using observed proxy benchmarks, and build reporting language that does not overstate what sensitivity analysis can prove.
Expected runtime: usually under one minute. Most cells are fast; the benchmark cell refits a few DoubleML models to calibrate observed-variable benchmarks.
Setup
This cell prepares output folders, imports DoubleML and plotting libraries, and suppresses only known notebook-environment noise. The rest of the notebook keeps code visible by default so the sensitivity workflow can be inspected line by line.
from pathlib import Pathimport osimport warningsPROJECT_ROOT = Path.cwd().resolve()if PROJECT_ROOT.name =="doubleml": PROJECT_ROOT = PROJECT_ROOT.parents[2]OUTPUT_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"doubleml"/"outputs"DATASET_DIR = OUTPUT_DIR /"datasets"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"REPORT_DIR = OUTPUT_DIR /"reports"MATPLOTLIB_CACHE_DIR = OUTPUT_DIR /"matplotlib_cache"for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]: directory.mkdir(parents=True, exist_ok=True)os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message="IProgress not found.*")warnings.filterwarnings("ignore", message="X does not have valid feature names.*")import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom IPython.display import Markdown, displayfrom matplotlib.ticker import PercentFormatterimport doubleml as dmlfrom doubleml import DoubleMLData, DoubleMLPLRfrom sklearn.base import clonefrom sklearn.linear_model import RidgeCVfrom sklearn.metrics import mean_absolute_error, mean_squared_errorfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerNOTEBOOK_PREFIX ="13"RANDOM_SEED =100TRUE_THETA =0.80TREATMENT_COL ="treatment"OUTCOME_COL ="outcome"sns.set_theme(style="whitegrid", context="notebook")plt.rcParams.update({"figure.dpi": 120, "savefig.dpi": 160})print(f"DoubleML version: {dml.__version__}")print(f"Output directory: {OUTPUT_DIR}")
The setup confirms that the notebook is using the local DoubleML installation and the shared tutorial output folder. All saved artifacts in this notebook use prefix 13.
Helper Functions
These helpers keep the notebook focused on sensitivity logic. The most important helper is extract_sensitivity_row(), which turns DoubleML’s nested sensitivity_params dictionary into a tidy row.
DoubleML stores robustness values as fractions. The helper records both the fraction and percentage form because the text summary prints percentages.
The helper functions make three choices explicit: which controls are observed, whether proxy variables are included, and whether the hidden confounder is used in an oracle-only model.
Sensitivity Vocabulary
This table defines the main quantities used in DoubleML sensitivity analysis. The names are compact in the API, so it helps to write their meaning before using them.
sensitivity_vocabulary = pd.DataFrame( [ {"term": "cf_y","meaning": "Share of residual outcome variation that a hidden confounder could explain.","practical reading": "Larger values mean a hidden variable predicts the outcome residual more strongly.", }, {"term": "cf_d","meaning": "Share-like strength for how much hidden confounding can affect the treatment-side score representation.","practical reading": "Larger values mean a hidden variable is more connected to treatment assignment after observed controls.", }, {"term": "rho","meaning": "Alignment between hidden outcome-side and treatment-side components.","practical reading": "rho = 1 is adversarial for a positive effect; lower absolute values are less extreme.", }, {"term": "theta bounds","meaning": "Bounds on the point estimate under a sensitivity scenario.","practical reading": "How far the effect estimate could move under the specified hidden confounding strength.", }, {"term": "CI bounds","meaning": "Sensitivity bounds that also include statistical uncertainty.","practical reading": "The conservative uncertainty range under the specified hidden confounding scenario.", }, {"term": "RV","meaning": "Robustness value for moving the estimate to the null hypothesis.","practical reading": "How strong confounding needs to be to move the point estimate to the target null.", }, {"term": "RVa","meaning": "Robustness value adjusted for statistical uncertainty.","practical reading": "How strong confounding needs to be to make the uncertainty-adjusted conclusion touch the target null.", }, ])save_table(sensitivity_vocabulary, f"{NOTEBOOK_PREFIX}_sensitivity_vocabulary.csv")display(sensitivity_vocabulary)
term
meaning
practical reading
0
cf_y
Share of residual outcome variation that a hid...
Larger values mean a hidden variable predicts ...
1
cf_d
Share-like strength for how much hidden confou...
Larger values mean a hidden variable is more c...
2
rho
Alignment between hidden outcome-side and trea...
rho = 1 is adversarial for a positive effect; ...
3
theta bounds
Bounds on the point estimate under a sensitivi...
How far the effect estimate could move under t...
4
CI bounds
Sensitivity bounds that also include statistic...
The conservative uncertainty range under the s...
5
RV
Robustness value for moving the estimate to th...
How strong confounding needs to be to move the...
6
RVa
Robustness value adjusted for statistical unce...
How strong confounding needs to be to make the...
The main habit is to treat these values as stress-test parameters. They help structure a discussion about hidden confounding; they do not reveal whether the hidden confounder actually exists.
Synthetic Hidden-Confounding Design
The synthetic data contains a hidden variable called hidden_intent. It affects both the treatment and outcome. In a real dataset this variable would not be observed. Here it is kept in the dataframe so we can compare realistic models against an oracle model.
We also create two observed proxy controls:
proxy_light: weakly related to the hidden confounder,
proxy_strong: more strongly related to the hidden confounder.
These proxies are useful for benchmark calibration. If omitting an observed proxy creates a certain amount of confounding, we can ask whether an unobserved factor of comparable strength would change the conclusion.
The first rows include the hidden confounder because this is a controlled teaching example. In applied work, the hidden variable is exactly what we do not have.
Data Dictionary And Audit
The audit records what each field means and shows that hidden_intent is related to treatment, outcome, and both proxy variables. This creates the conditions for hidden-confounding bias.
The proxies are correlated with the hidden confounder but do not perfectly reveal it. This is the useful middle ground for sensitivity teaching: adjustment helps, but hidden confounding can still remain.
Hidden Confounding Design Diagram
The diagram shows the logic of the synthetic design. Solid arrows are observed parts of the analysis. Dashed arrows mark the hidden path that sensitivity analysis is stress-testing.
The proxy controls block part of the hidden path because they carry information about hidden_intent. The remaining dashed path is the reason we still need a sensitivity stress test.
Fit Realistic And Oracle Models
We fit three PLR models:
observed base controls only,
observed base controls plus proxies,
oracle model with the hidden confounder.
The oracle model is not a realistic analysis. It is a teaching reference that shows what would happen if the missing confounder were actually observed.
The proxy-adjusted model moves toward the oracle estimate, but it does not fully remove hidden-confounding bias. This is exactly the setting where sensitivity analysis is useful: the main estimate is credible only if remaining hidden confounding is not too strong.
Model Comparison Plot
The plot below shows how estimates move as more information about the hidden confounder becomes available. The red dashed line is the true simulated treatment effect.
The realistic proxy-adjusted model is still above the true effect. Sensitivity analysis will ask how strong the remaining hidden confounding would need to be to move the estimate meaningfully.
Run A Primary Sensitivity Scenario
We use the proxy-adjusted model as the main realistic specification. The first sensitivity scenario uses cf_y = 0.04, cf_d = 0.04, and rho = 1.0. This is a modest but adversarial hidden-confounding scenario.
The null hypothesis is set to zero because a common reporting question is whether the result could be moved to no effect.
The bounds show how far the estimate and confidence interval could move under the chosen scenario. The robustness values are large for the zero null, meaning a fairly strong hidden confounder would be needed to move this positive estimate all the way to zero.
Sensitivity Bounds Plot
This plot separates three quantities that are easy to mix up: the original estimate, the sensitivity-adjusted point-estimate bounds, and the sensitivity-adjusted confidence bounds.
The sensitivity-adjusted confidence bound is wider than the point-estimate bound because it combines hidden-confounding stress with statistical uncertainty.
Sensitivity Grid
A single scenario can hide how the conclusion changes across parameter choices. This grid evaluates many combinations of cf_y and cf_d at rho = 1.0.
The grid is not a new model fit. It reuses the fitted DoubleML score objects and evaluates sensitivity formulas across possible hidden-confounding strengths.
cf_values = np.linspace(0.0, 0.12, 25)grid_rows = []for cf_y in cf_values:for cf_d in cf_values: main_model.sensitivity_analysis(cf_y=float(cf_y), cf_d=float(cf_d), rho=1.0, level=0.95, null_hypothesis=0.0) grid_rows.append(extract_sensitivity_row(main_model, scenario="grid", null_hypothesis=0.0))sensitivity_grid = pd.DataFrame(grid_rows)save_table(sensitivity_grid, f"{NOTEBOOK_PREFIX}_sensitivity_grid.csv")# Restore the primary scenario after grid evaluation.main_model.sensitivity_analysis(cf_y=0.04, cf_d=0.04, rho=1.0, level=0.95, null_hypothesis=0.0)display(sensitivity_grid.head())
scenario
cf_y
cf_d
rho
level
null_hypothesis
theta_hat
theta_lower
theta_upper
ci_lower
ci_upper
rv
rva
rv_percent
rva_percent
true_theta
theta_lower_below_true
ci_lower_below_true
0
grid
0.0
0.000
1.0
0.95
0.0
0.892058
0.892058
0.892058
0.852738
0.931378
0.578486
0.559002
57.848598
55.900165
0.8
False
False
1
grid
0.0
0.005
1.0
0.95
0.0
0.892058
0.892058
0.892058
0.852738
0.931378
0.578486
0.559002
57.848598
55.900165
0.8
False
False
2
grid
0.0
0.010
1.0
0.95
0.0
0.892058
0.892058
0.892058
0.852738
0.931378
0.578486
0.559002
57.848598
55.900165
0.8
False
False
3
grid
0.0
0.015
1.0
0.95
0.0
0.892058
0.892058
0.892058
0.852738
0.931378
0.578486
0.559002
57.848598
55.900165
0.8
False
False
4
grid
0.0
0.020
1.0
0.95
0.0
0.892058
0.892058
0.892058
0.852738
0.931378
0.578486
0.559002
57.848598
55.900165
0.8
False
False
The grid table records lower and upper bounds for every pair of sensitivity parameters. The next figure turns those rows into a contour-style stress map.
Sensitivity Contour Plot
The contour plot shows the lower theta bound across the cf_y and cf_d grid. Benchmark points from observed proxy variables will be added in the next section; for now this plot shows the stress surface itself.
Moving up and right makes the hidden confounder stronger. The lower bound declines as the hidden confounder is allowed to explain more residual outcome and treatment-side variation.
Benchmarking With Observed Proxy Variables
DoubleML.sensitivity_benchmark() calibrates sensitivity parameters using observed controls. The idea is to temporarily treat an observed variable as if it had been omitted, then measure how much that omission would matter.
This is useful because raw cf_y and cf_d values are hard to reason about. Benchmarks give them scale.
The stronger proxy creates larger benchmark values. The combined proxy benchmark gives a concrete reference for a hidden confounder with strength comparable to the observed proxy information.
Benchmark-Calibrated Scenario
Now we turn the combined proxy benchmark into a sensitivity scenario. This asks: if the remaining hidden confounder were as strong as the observed proxy information, how far could the estimate move?
The benchmark-calibrated scenario is stronger than the primary scenario. It moves the lower bound closer to the true simulated effect, showing how benchmark calibration can make sensitivity analysis more concrete.
Sensitivity Grid With Benchmarks
The same contour plot becomes more useful after adding benchmark points. The benchmark points show where observed proxy omissions sit on the hidden-confounding grid.
The benchmark points make the grid less abstract. A reader can compare the assumed hidden-confounding scenario to the impact of omitting observed proxy variables.
Native DoubleML Sensitivity Plot
DoubleML also provides an interactive Plotly sensitivity plot. We save it as HTML so it can be opened separately without requiring static image export support.
Native DoubleML sensitivity plot written to /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/13_native_doubleml_sensitivity_plot.html
The native plot is useful for exploration, while the static Matplotlib figures above are easier to include in reports and version-controlled artifacts.
Role Of Rho
The rho parameter controls the alignment between hidden outcome-side and treatment-side components. For a positive estimated effect, rho = 1 is the adversarial direction that most weakens the effect. Smaller absolute values imply less aligned hidden confounding.
The lower bound is most conservative when rho is positive and large. Reporting the rho choice helps readers understand whether the sensitivity scenario is adversarial or mild.
Rho Sensitivity Plot
This plot shows how the lower and upper theta bounds change as rho varies while cf_y and cf_d stay fixed.
The bounds widen in the adversarial direction as alignment increases. This is why rho = 1 is a common conservative default for a positive effect.
Robustness Values For Different Null Hypotheses
Robustness values depend on the null hypothesis. A null of zero asks how strong hidden confounding must be to erase the effect. In this simulation, we can also ask how strong hidden confounding must be to move the estimate to the known true effect. That second question is simulation-only, because real analyses do not know the true effect.
The zero null requires much stronger confounding than a null close to the observed estimate. This is why robustness values must always be reported with the null hypothesis they refer to.
Robustness Value Plot
The plot below shows how RV and RVa change across null hypotheses. RVa is smaller because it also accounts for statistical uncertainty.
The curve falls as the null hypothesis gets closer to the observed estimate. This is a useful safeguard against vague language such as “the result is robust” without saying robust to what.
Practical Reporting Language
The table below translates common sensitivity findings into careful reporting language. The goal is to avoid overstating what the analysis can prove.
reporting_language = pd.DataFrame( [ {"finding": "Point estimate remains positive under modest sensitivity scenario","careful wording": "Under the specified hidden-confounding scenario, the sensitivity-adjusted lower bound remains positive.","avoid saying": "Hidden confounding is impossible.", }, {"finding": "Benchmark-calibrated scenario moves the estimate materially","careful wording": "A hidden factor comparable to the benchmarked observed proxies could materially reduce the estimate.","avoid saying": "The benchmark proves the true bias size.", }, {"finding": "Robustness value to zero is high","careful wording": "Moving the point estimate to zero would require a hidden confounder of the reported strength under the model assumptions.","avoid saying": "The result is automatically causal.", }, {"finding": "RVa is much lower than RV","careful wording": "Accounting for sampling uncertainty makes the conclusion more sensitive than the point estimate alone suggests.","avoid saying": "The point-estimate robustness value is enough by itself.", }, ])save_table(reporting_language, f"{NOTEBOOK_PREFIX}_reporting_language.csv")display(reporting_language)
finding
careful wording
avoid saying
0
Point estimate remains positive under modest s...
Under the specified hidden-confounding scenari...
Hidden confounding is impossible.
1
Benchmark-calibrated scenario moves the estima...
A hidden factor comparable to the benchmarked ...
The benchmark proves the true bias size.
2
Robustness value to zero is high
Moving the point estimate to zero would requir...
The result is automatically causal.
3
RVa is much lower than RV
Accounting for sampling uncertainty makes the ...
The point-estimate robustness value is enough ...
Careful wording keeps sensitivity analysis in its proper role: a stress test for assumptions, not a replacement for design validation.
Validation Plan
Sensitivity analysis should lead to better data questions. The table below lists concrete follow-up actions an analyst could take after seeing the sensitivity results.
validation_plan = pd.DataFrame( [ {"risk": "Important unobserved user intent remains after proxy adjustment","diagnostic_or_action": "Search for richer pre-treatment proxies or historical behavior features.","expected benefit": "Reduce plausible cf_y and cf_d by measuring the hidden source more directly.", }, {"risk": "Treatment timing may follow outcome anticipation","diagnostic_or_action": "Audit feature timestamps and remove post-treatment variables from controls.","expected benefit": "Protect the design from bad-control bias and reverse timing.", }, {"risk": "Benchmark variables are weak comparisons for the hidden confounder","diagnostic_or_action": "Benchmark several observed covariate groups separately and jointly.","expected benefit": "Give readers multiple scales for plausible hidden-confounder strength.", }, {"risk": "Sensitivity bounds are close to the decision threshold","diagnostic_or_action": "Use a stronger design where possible: experiment, instrument, panel design, or negative control.","expected benefit": "Shift credibility from modeling assumptions toward design-based evidence.", }, ])save_table(validation_plan, f"{NOTEBOOK_PREFIX}_validation_plan.csv")display(validation_plan)
risk
diagnostic_or_action
expected benefit
0
Important unobserved user intent remains after...
Search for richer pre-treatment proxies or his...
Reduce plausible cf_y and cf_d by measuring th...
1
Treatment timing may follow outcome anticipation
Audit feature timestamps and remove post-treat...
Protect the design from bad-control bias and r...
2
Benchmark variables are weak comparisons for t...
Benchmark several observed covariate groups se...
Give readers multiple scales for plausible hid...
3
Sensitivity bounds are close to the decision t...
Use a stronger design where possible: experime...
Shift credibility from modeling assumptions to...
A good sensitivity section does not end with a number. It should explain what data or design improvement would reduce the remaining uncertainty.
Report Template And Artifact Manifest
The final cell writes a reusable report template and an artifact manifest. The template emphasizes the sensitivity parameters, benchmark calibration, and design caveats.
report_text =f"""# Sensitivity Analysis Report Template## Causal Design- Outcome:- Treatment:- Observed controls:- Main identification assumption:- Main omitted-confounding concern:## Main Estimate- Point estimate:- Standard error:- Confidence interval:- Primary DoubleML model:## Sensitivity Scenario- cf_y:- cf_d:- rho:- Null hypothesis:- Theta lower and upper bounds:- CI lower and upper bounds:- RV and RVa:## Benchmark Calibration- Benchmarking set(s):- Benchmark cf_y and cf_d:- Benchmark-calibrated bounds:- Why these benchmarks are plausible or limited:## Conclusion- What remains stable under the tested scenario:- What changes under stronger hidden confounding:- What additional data or design would reduce concern:""".strip()report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_sensitivity_report_template.md"report_path.write_text(report_text)artifact_manifest = pd.DataFrame( [ {"artifact": "synthetic hidden-confounding data", "path": str(DATASET_DIR /f"{NOTEBOOK_PREFIX}_synthetic_hidden_confounding_data.csv")}, {"artifact": "model comparison", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_model_comparison.csv")}, {"artifact": "primary sensitivity", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_primary_sensitivity.csv")}, {"artifact": "sensitivity grid", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_sensitivity_grid.csv")}, {"artifact": "sensitivity benchmarks", "path": str(TABLE_DIR /f"{NOTEBOOK_PREFIX}_sensitivity_benchmarks.csv")}, {"artifact": "native DoubleML sensitivity plot", "path": str(native_plot_path)}, {"artifact": "report template", "path": str(report_path)}, {"artifact": "hidden confounding design figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_hidden_confounding_design.png")}, {"artifact": "benchmark grid figure", "path": str(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_sensitivity_grid_with_benchmarks.png")}, ])save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")display(Markdown(f"Report template written to `{report_path}`"))display(artifact_manifest)
Report template written to /home/apex/Documents/ranking_sys/notebooks/tutorials/doubleml/outputs/reports/13_sensitivity_report_template.md
artifact
path
0
synthetic hidden-confounding data
/home/apex/Documents/ranking_sys/notebooks/tut...
1
model comparison
/home/apex/Documents/ranking_sys/notebooks/tut...
2
primary sensitivity
/home/apex/Documents/ranking_sys/notebooks/tut...
3
sensitivity grid
/home/apex/Documents/ranking_sys/notebooks/tut...
4
sensitivity benchmarks
/home/apex/Documents/ranking_sys/notebooks/tut...
5
native DoubleML sensitivity plot
/home/apex/Documents/ranking_sys/notebooks/tut...
6
report template
/home/apex/Documents/ranking_sys/notebooks/tut...
7
hidden confounding design figure
/home/apex/Documents/ranking_sys/notebooks/tut...
8
benchmark grid figure
/home/apex/Documents/ranking_sys/notebooks/tut...
The notebook now has a full DoubleML sensitivity workflow: omitted-confounding setup, realistic and oracle models, sensitivity bounds, observed-variable benchmarks, rho stress testing, robustness values, and reporting guidance.
What Comes Next
The next natural topic is heterogeneous treatment effects: GATE, CATE-style summaries, best linear predictors, and how to talk about heterogeneity without confusing subgroup exploration with a new identification strategy.