16. AI for Sensitivity Analysis

Sensitivity analysis asks a disciplined question:

How wrong would our causal conclusion be under plausible violations of the assumptions?

This is different from asking whether a coefficient is statistically significant. An observational estimate can be precisely estimated and still be badly biased. A sensitivity analysis makes design fragility visible: unobserved confounding, omitted variables, weak overlap, measurement error, placebo failures, and bad-control risks.

This notebook shows how AI can help structure sensitivity analysis without replacing it. Python computes the evidence. The LLM helps organize interpretation and communication. The audit layer checks whether the model overclaims robustness.

Learning Goals

By the end of this notebook you should be able to:

Explain why sensitivity analysis is part of the causal result, not an optional appendix.
Estimate a baseline observational effect and compare it with an oracle estimate in synthetic data.
Use observed covariates as benchmarks for omitted-variable sensitivity.
Build a residual-confounding sensitivity grid that asks how strong an unobserved confounder would need to be.
Use placebo outcomes to detect design fragility.
Assemble a machine-readable sensitivity report.
Use local LLMs to summarize sensitivity evidence while auditing them for overconfident robustness claims.

Live Model Note

This course treats LLM behavior as an empirical object. These notebooks may include live local-model calls, so outputs can vary across model versions, hardware, decoding settings, prompt wording, package versions, and reruns. That instability is part of the lesson: AI-assisted causal work needs deterministic checks, structured outputs, model comparison, repair logic, and human review.

Sensitivity-analysis notebooks are brittle because the interpretation is subtle. A model may correctly list sensitivity checks but still say “robust” too quickly. It may ignore a placebo failure, confuse oracle synthetic columns with real evidence, or treat a benchmark calculation as proof. We will make those risks explicit and score the model’s interpretation against deterministic gates.

1. Setup

The deterministic part of this notebook uses a synthetic observational example with a known hidden confounder. In real data we would not observe the hidden confounder; here we keep it for teaching and oracle comparisons.

import importlib.util
import json
import sys
import textwrap
import warnings
from pathlib import Path
from typing import Literal

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import torch
from IPython.display import Markdown, display
from pydantic import BaseModel, Field
from sklearn.linear_model import LinearRegression

warnings.filterwarnings('ignore', category=FutureWarning)

RUN_LIVE_LOCAL_LLM = True
RUN_FULL_MODEL_COMPARISON = True
RUN_SCHEMA_REPAIR_RETRY = True

LOCAL_SMOKE_TEST_MODEL = 'Qwen/Qwen2.5-0.5B-Instruct'
LOCAL_FAST_MODEL = 'Qwen/Qwen2.5-7B-Instruct'
LOCAL_STRONG_MODEL = 'Qwen/Qwen2.5-14B-Instruct'
LOCAL_SCALE_MODEL = 'Qwen/Qwen2.5-32B-Instruct'

LOCAL_ALT_REASONING_MODEL = 'microsoft/Phi-3.5-mini-instruct'
LOCAL_ALT_OPEN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.3'
LOCAL_MISTRAL_SMALL_MODEL = 'mistralai/Mistral-Small-3.1-24B-Instruct-2503'
LOCAL_GEMMA_MODEL = 'google/gemma-3-27b-it'
LOCAL_LLAMA_MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

MODEL_ID = LOCAL_STRONG_MODEL
MAX_NEW_TOKENS = 1900
TEMPERATURE = 0.0
SEED = 216
MODEL_COMPARISON_CASE_LIMIT = 2

MODELS_TO_COMPARE = [
    ('Qwen 0.5B', LOCAL_SMOKE_TEST_MODEL, 'pipeline smoke test'),
    ('Qwen 7B', LOCAL_FAST_MODEL, 'fast default'),
    ('Qwen 14B', LOCAL_STRONG_MODEL, 'strong local analysis'),
    ('Qwen 32B', LOCAL_SCALE_MODEL, 'scale comparison'),
    ('Phi mini', LOCAL_ALT_REASONING_MODEL, 'compact non-Qwen comparison'),
    ('Mistral 7B', LOCAL_ALT_OPEN_MODEL, '7B model-family comparison'),
    ('Mistral Small 24B', LOCAL_MISTRAL_SMALL_MODEL, 'strong non-Qwen comparison'),
    ('Gemma 3 27B', LOCAL_GEMMA_MODEL, 'large non-Qwen comparison'),
    ('Llama 3.1 8B', LOCAL_LLAMA_MODEL, 'industry-standard instruct baseline'),
]

np.random.seed(SEED)
sns.set_theme(style='whitegrid', context='notebook')
pd.set_option('display.max_colwidth', 160)

def has_package(module_name):
    return importlib.util.find_spec(module_name) is not None

package_status = pd.DataFrame(
    [
        {'package': 'pandas', 'available': has_package('pandas'), 'used_for': 'sensitivity tables'},
        {'package': 'statsmodels', 'available': has_package('statsmodels'), 'used_for': 'effect and placebo regressions'},
        {'package': 'sklearn', 'available': has_package('sklearn'), 'used_for': 'residualization helpers'},
        {'package': 'seaborn', 'available': has_package('seaborn'), 'used_for': 'sensitivity plots'},
        {'package': 'pydantic', 'available': has_package('pydantic'), 'used_for': 'structured AI sensitivity review'},
        {'package': 'transformers', 'available': has_package('transformers'), 'used_for': 'optional local LLM review'},
        {'package': 'torch', 'available': has_package('torch'), 'used_for': 'GPU inference if live LLMs are enabled'},
    ]
)

print(f'CUDA available to this kernel: {torch.cuda.is_available()}')
package_status

CUDA available to this kernel: True

	package	available	used_for
0	pandas	True	sensitivity tables
1	statsmodels	True	effect and placebo regressions
2	sklearn	True	residualization helpers
3	seaborn	True	sensitivity plots
4	pydantic	True	structured AI sensitivity review
5	transformers	True	optional local LLM review
6	torch	True	GPU inference if live LLMs are enabled

2. What Sensitivity Analysis Is For

Sensitivity analysis is not pessimism. It is disciplined honesty.

A standard adjusted estimate says:

Under the stated assumptions, this is the estimated effect.

A sensitivity analysis adds:

If the assumptions are wrong in these specific ways, this is how the estimate would move.

The second sentence is often the difference between a defensible causal report and a brittle dashboard claim.

sensitivity_components = pd.DataFrame(
    [
        {
            'component': 'Benchmark confounders',
            'question': 'How much does the treatment effect move when important observed covariates are omitted?',
            'what it teaches': 'Plausible scale for omitted-variable bias.',
        },
        {
            'component': 'Residual confounding grid',
            'question': 'How strong would a hidden confounder need to be with residual treatment and outcome?',
            'what it teaches': 'Tipping points for sign reversal or loss of practical significance.',
        },
        {
            'component': 'Placebo outcome',
            'question': 'Does treatment appear to affect an outcome measured before treatment?',
            'what it teaches': 'Evidence of residual confounding or selection.',
        },
        {
            'component': 'Oracle comparison in simulation',
            'question': 'How far is feasible adjustment from the estimate that includes hidden truth variables?',
            'what it teaches': 'What the observed analysis is missing in a controlled lab.',
        },
        {
            'component': 'Communication warning',
            'question': 'What should stakeholders not conclude from this estimate?',
            'what it teaches': 'Prevents robustness language from becoming overclaiming.',
        },
    ]
)
sensitivity_components

	component	question	what it teaches
0	Benchmark confounders	How much does the treatment effect move when important observed covariates are omitted?	Plausible scale for omitted-variable bias.
1	Residual confounding grid	How strong would a hidden confounder need to be with residual treatment and outcome?	Tipping points for sign reversal or loss of practical significance.
2	Placebo outcome	Does treatment appear to affect an outcome measured before treatment?	Evidence of residual confounding or selection.
3	Oracle comparison in simulation	How far is feasible adjustment from the estimate that includes hidden truth variables?	What the observed analysis is missing in a controlled lab.
4	Communication warning	What should stakeholders not conclude from this estimate?	Prevents robustness language from becoming overclaiming.

Discussion

No single sensitivity method is universal. The right set depends on the design: unmeasured confounding for observational adjustment, pre-trends for DiD, exclusion violations for IV, manipulation for RDD, interference for experiments, and logging-policy overlap for off-policy evaluation.

3. Running Example: Onboarding Specialist Assignment

A SaaS company assigns some customers to a specialist onboarding program. The business asks:

What is the effect of specialist onboarding on 90-day net revenue?

Assignment is observational. Managers prioritize accounts with high value, high risk, and signs of executive attention. Some of that attention is recorded; some is not. We simulate the hidden variable so we can teach sensitivity analysis.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def simulate_onboarding_data(n=5200, seed=SEED, hidden_strength=0.65, overlap_noise=0.75):
    rng = np.random.default_rng(seed)
    segment = rng.choice(['startup', 'midmarket', 'enterprise'], size=n, p=[0.46, 0.38, 0.16])
    region = rng.choice(['Americas', 'EMEA', 'APAC'], size=n, p=[0.48, 0.33, 0.19])
    tenure_months = rng.gamma(shape=2.5, scale=7.5, size=n).clip(1, 72)
    baseline_usage = rng.normal(0, 1, size=n)
    prior_support_tickets = rng.poisson(np.exp(-0.20 - 0.18 * baseline_usage + 0.24 * (segment == 'enterprise')), size=n)
    contract_value = np.exp(3.3 + 0.45 * (segment == 'enterprise') + 0.18 * baseline_usage + rng.normal(0, 0.42, size=n))
    observed_health = 0.62 * baseline_usage + 0.018 * tenure_months - 0.24 * prior_support_tickets + 0.18 * (region == 'Americas') + rng.normal(0, 0.58, size=n)
    risk_score = -0.86 * observed_health + 0.30 * prior_support_tickets + 0.12 * np.log1p(contract_value) + rng.normal(0, 0.45, size=n)

    executive_attention = rng.normal(0, 1, size=n)
    prior_period_revenue = (
        24
        + 0.34 * contract_value
        + 7.0 * observed_health
        + hidden_strength * 7.5 * executive_attention
        + rng.normal(0, 18, size=n)
    )

    assignment_logit = (
        -0.25
        + 1.05 * risk_score
        + 0.24 * np.log1p(contract_value)
        + 0.18 * (segment == 'enterprise')
        + hidden_strength * 0.85 * executive_attention
        + rng.normal(0, overlap_noise, size=n)
    )
    onboarding_specialist = rng.binomial(1, sigmoid(assignment_logit))

    true_effect = 7.0 + 2.0 * sigmoid(risk_score) + 1.2 * (segment == 'enterprise')
    net_revenue_90d = (
        30
        + 0.42 * contract_value
        + 10.0 * observed_health
        - 2.0 * prior_support_tickets
        + hidden_strength * 9.0 * executive_attention
        + true_effect * onboarding_specialist
        + rng.normal(0, 22, size=n)
    )
    product_adoption_30d = 0.45 * baseline_usage + 0.55 * onboarding_specialist + 0.30 * executive_attention + rng.normal(0, 1, size=n)
    expansion_pipeline_120d = net_revenue_90d + 0.20 * contract_value + 12 * onboarding_specialist + rng.normal(0, 20, size=n)

    df = pd.DataFrame(
        {
            'account_id': [f'A{idx:05d}' for idx in range(n)],
            'segment': segment,
            'region': region,
            'tenure_months': tenure_months,
            'baseline_usage': baseline_usage,
            'prior_support_tickets': prior_support_tickets,
            'contract_value': contract_value,
            'observed_health': observed_health,
            'risk_score': risk_score,
            'executive_attention': executive_attention,
            'prior_period_revenue': prior_period_revenue,
            'onboarding_specialist': onboarding_specialist,
            'true_effect': true_effect,
            'net_revenue_90d': net_revenue_90d,
            'product_adoption_30d': product_adoption_30d,
            'expansion_pipeline_120d': expansion_pipeline_120d,
        }
    )
    truth = {
        'true_ate': float(df['true_effect'].mean()),
        'true_att': float(df.loc[df['onboarding_specialist'].eq(1), 'true_effect'].mean()),
        'treatment_rate': float(df['onboarding_specialist'].mean()),
        'hidden_strength': hidden_strength,
    }
    return df, truth

raw, truth = simulate_onboarding_data()
raw.head()

	account_id	segment	region	tenure_months	baseline_usage	prior_support_tickets	contract_value	observed_health	risk_score	executive_attention	prior_period_revenue	onboarding_specialist	true_effect	net_revenue_90d	product_adoption_30d	expansion_pipeline_120d
0	A00000	enterprise	Americas	19.767521	1.440981	1	42.053951	0.372740	0.646366	-0.881831	35.639973	0	9.512382	73.822233	1.491241	125.168395
1	A00001	midmarket	Americas	2.714544	-0.799302	0	37.342631	0.143256	0.532391	0.166796	24.695390	1	8.260081	96.890555	-0.940938	108.780270
2	A00002	startup	EMEA	54.283982	-1.945033	1	10.483906	0.664454	-0.882747	-1.052783	1.203831	1	7.585218	38.893951	-0.752022	43.501021
3	A00003	midmarket	Americas	20.707722	-0.813210	3	12.673929	-0.047837	2.111590	0.039967	43.765324	1	8.784049	30.239183	-1.263760	37.587590
4	A00004	midmarket	EMEA	17.547371	-1.420773	1	38.000833	-0.279677	0.550063	-1.428694	14.994348	0	8.268300	78.246247	-0.659701	56.833462

truth

{'true_ate': 8.388275446972925,
 'true_att': 8.50067894836023,
 'treatment_rate': 0.685,
 'hidden_strength': 0.65}

4. Analysis Contract and Variable Roles

The sensitivity report is tied to a specific estimand and variable role map. The hidden variable is included only because this is a synthetic teaching notebook.

analysis_contract = {
    'analysis_id': 'onboarding_specialist_sensitivity_v1',
    'estimand': 'ATE of onboarding_specialist on net_revenue_90d among eligible accounts',
    'treatment': 'onboarding_specialist',
    'outcome': 'net_revenue_90d',
    'placebo_outcome': 'prior_period_revenue',
    'allowed_covariates': [
        'segment',
        'region',
        'tenure_months',
        'baseline_usage',
        'prior_support_tickets',
        'contract_value',
        'observed_health',
        'risk_score',
    ],
    'categorical_covariates': ['segment', 'region'],
    'oracle_hidden_confounders': ['executive_attention'],
    'forbidden_variables': ['product_adoption_30d', 'expansion_pipeline_120d', 'true_effect'],
    'diagnostic_thresholds': {
        'placebo_caution_revenue_units': 4.0,
        'tipping_point_distance_caution': 0.35,
        'benchmark_shift_caution_ratio': 1.0,
    },
}

variable_roles = pd.DataFrame(
    [{'variable': var, 'role': 'allowed pre-treatment covariate'} for var in analysis_contract['allowed_covariates']]
    + [{'variable': analysis_contract['treatment'], 'role': 'treatment'}]
    + [{'variable': analysis_contract['outcome'], 'role': 'outcome'}]
    + [{'variable': analysis_contract['placebo_outcome'], 'role': 'pre-treatment placebo outcome'}]
    + [{'variable': var, 'role': 'oracle hidden confounder for teaching only'} for var in analysis_contract['oracle_hidden_confounders']]
    + [{'variable': var, 'role': 'forbidden post-treatment/future/oracle variable'} for var in analysis_contract['forbidden_variables']]
)
variable_roles

	variable	role
0	segment	allowed pre-treatment covariate
1	region	allowed pre-treatment covariate
2	tenure_months	allowed pre-treatment covariate
3	baseline_usage	allowed pre-treatment covariate
4	prior_support_tickets	allowed pre-treatment covariate
5	contract_value	allowed pre-treatment covariate
6	observed_health	allowed pre-treatment covariate
7	risk_score	allowed pre-treatment covariate
8	onboarding_specialist	treatment
9	net_revenue_90d	outcome
10	prior_period_revenue	pre-treatment placebo outcome
11	executive_attention	oracle hidden confounder for teaching only
12	product_adoption_30d	forbidden post-treatment/future/oracle variable
13	expansion_pipeline_120d	forbidden post-treatment/future/oracle variable
14	true_effect	forbidden post-treatment/future/oracle variable

Discussion

The forbidden variables are important. product_adoption_30d is post-treatment. expansion_pipeline_120d is future-looking. true_effect is simulation truth. Sensitivity analysis should not quietly turn these into controls.

5. Baseline, Oracle, and Placebo Estimates

We estimate three models:

A feasible adjusted model using observed pre-treatment covariates.
An oracle model that additionally controls for the hidden confounder. This is for teaching only.
A placebo model using the same adjustment set but with a pre-treatment outcome.

def formula_term(variable, categorical):
    return f'C({variable})' if variable in categorical else variable


def build_formula(outcome, treatment, covariates, categorical):
    terms = [formula_term(var, categorical) for var in covariates]
    return f"{outcome} ~ {treatment} + " + ' + '.join(terms)


def fit_effect_model(df, outcome, treatment, covariates, categorical):
    formula = build_formula(outcome, treatment, covariates, categorical)
    fitted = smf.ols(formula, data=df).fit(cov_type='HC3')
    estimate = fitted.params[treatment]
    se = fitted.bse[treatment]
    return {
        'formula': formula,
        'estimate': float(estimate),
        'robust_se': float(se),
        'ci_low': float(estimate - 1.96 * se),
        'ci_high': float(estimate + 1.96 * se),
        'p_value': float(fitted.pvalues[treatment]),
        'rows': int(fitted.nobs),
        'model': fitted,
    }

observed_result = fit_effect_model(
    raw,
    analysis_contract['outcome'],
    analysis_contract['treatment'],
    analysis_contract['allowed_covariates'],
    analysis_contract['categorical_covariates'],
)
oracle_result = fit_effect_model(
    raw,
    analysis_contract['outcome'],
    analysis_contract['treatment'],
    analysis_contract['allowed_covariates'] + analysis_contract['oracle_hidden_confounders'],
    analysis_contract['categorical_covariates'],
)
placebo_result = fit_effect_model(
    raw,
    analysis_contract['placebo_outcome'],
    analysis_contract['treatment'],
    analysis_contract['allowed_covariates'],
    analysis_contract['categorical_covariates'],
)

estimate_table = pd.DataFrame(
    [
        {'model': 'observed adjusted', 'target': 'outcome', **{k: v for k, v in observed_result.items() if k != 'model'}},
        {'model': 'oracle hidden-confounder adjusted', 'target': 'outcome', **{k: v for k, v in oracle_result.items() if k != 'model'}},
        {'model': 'observed adjusted placebo', 'target': 'placebo', **{k: v for k, v in placebo_result.items() if k != 'model'}},
    ]
)
estimate_table[['model', 'target', 'estimate', 'robust_se', 'ci_low', 'ci_high', 'p_value', 'rows']]

	model	target	estimate	robust_se	ci_low	ci_high	p_value	rows
0	observed adjusted	outcome	9.757901	0.729626	8.327834	11.187967	8.597864e-41	5200
1	oracle hidden-confounder adjusted	outcome	7.336248	0.717084	5.930764	8.741732	1.445147e-24	5200
2	observed adjusted placebo	placebo	2.194321	0.604380	1.009736	3.378906	2.826560e-04	5200

fig, ax = plt.subplots(figsize=(9, 3.8))
plot_estimates = estimate_table.query("target == 'outcome'").copy()
ax.errorbar(
    plot_estimates['estimate'],
    plot_estimates['model'],
    xerr=[plot_estimates['estimate'] - plot_estimates['ci_low'], plot_estimates['ci_high'] - plot_estimates['estimate']],
    fmt='o',
    color='#4477AA',
    ecolor='#777777',
    capsize=4,
)
ax.axvline(truth['true_ate'], color='#CC3311', linestyle='--', linewidth=2, label='true ATE')
ax.axvline(0, color='black', linewidth=1)
ax.set_title('Observed versus oracle sensitivity check')
ax.set_xlabel('Estimated effect on 90-day net revenue')
ax.set_ylabel('')
ax.legend()
plt.tight_layout()
plt.show()

Discussion

The oracle comparison is a teaching device. In a real project we do not have executive_attention. If the oracle estimate differs from the observed-adjusted estimate, it shows why hidden confounding matters.

The placebo estimate is also important. Current treatment cannot cause prior-period revenue. A nonzero placebo effect is a warning about residual confounding or selection.

6. Benchmark Confounder Analysis

One practical sensitivity habit is to ask:

How much does the treatment coefficient move when important observed covariates are removed?

Those movements do not prove how large unobserved confounding is, but they give a scale for reasoning about omitted variables.

def refit_without_covariate(df, contract, omitted):
    covariates = [var for var in contract['allowed_covariates'] if var != omitted]
    categorical = [var for var in contract['categorical_covariates'] if var in covariates]
    result = fit_effect_model(df, contract['outcome'], contract['treatment'], covariates, categorical)
    return result['estimate']

benchmark_rows = []
for covariate in analysis_contract['allowed_covariates']:
    reduced_estimate = refit_without_covariate(raw, analysis_contract, covariate)
    shift = reduced_estimate - observed_result['estimate']
    benchmark_rows.append(
        {
            'omitted_covariate': covariate,
            'estimate_without_covariate': reduced_estimate,
            'shift_from_full_model': shift,
            'abs_shift': abs(shift),
            'shift_as_share_of_estimate': abs(shift) / abs(observed_result['estimate']),
        }
    )
benchmark_table = pd.DataFrame(benchmark_rows).sort_values('abs_shift', ascending=False)
benchmark_table

	omitted_covariate	estimate_without_covariate	shift_from_full_model	abs_shift	shift_as_share_of_estimate
7	risk_score	10.041754	0.283853	0.283853	0.029090
5	contract_value	10.013070	0.255169	0.255169	0.026150
6	observed_health	9.734774	-0.023127	0.023127	0.002370
0	segment	9.776957	0.019056	0.019056	0.001953
4	prior_support_tickets	9.741525	-0.016376	0.016376	0.001678
1	region	9.768545	0.010644	0.010644	0.001091
2	tenure_months	9.765030	0.007129	0.007129	0.000731
3	baseline_usage	9.752294	-0.005607	0.005607	0.000575

fig, ax = plt.subplots(figsize=(8, 4.8))
sns.barplot(data=benchmark_table, y='omitted_covariate', x='shift_from_full_model', color='#AA3377', ax=ax)
ax.axvline(0, color='black', linewidth=1)
ax.set_title('Coefficient movement when observed covariates are omitted')
ax.set_xlabel('Shift in estimated treatment effect')
ax.set_ylabel('')
plt.tight_layout()
plt.show()

Discussion

If omitting one observed covariate moves the estimate substantially, then an unobserved covariate of comparable importance could also matter. Benchmarking makes that conversation concrete.

7. Residual-Confounding Sensitivity Grid

We now ask a hypothetical question:

Suppose there is one unobserved confounder left after adjusting for observed covariates. If that confounder has correlation r_tu with the residualized treatment and r_yu with the residualized outcome, how would the treatment coefficient change?

This is a stylized sensitivity analysis. It is not a substitute for design knowledge, but it makes the tipping point visible.

def design_matrix_for_contract(df, covariates, categorical):
    X = pd.get_dummies(df[covariates], columns=categorical, drop_first=True, dtype=float)
    X = pd.concat([pd.Series(1.0, index=X.index, name='intercept'), X], axis=1)
    return X


def residualize(series, X):
    y = np.asarray(series, dtype=float)
    beta = np.linalg.lstsq(np.asarray(X, dtype=float), y, rcond=None)[0]
    return y - np.asarray(X, dtype=float) @ beta


def coefficient_with_unobserved_confounder(resid_t, resid_y, r_tu, r_yu):
    resid_t = np.asarray(resid_t, dtype=float)
    resid_y = np.asarray(resid_y, dtype=float)
    var_t = np.var(resid_t, ddof=0)
    var_y = np.var(resid_y, ddof=0)
    cov_ty = np.mean((resid_t - resid_t.mean()) * (resid_y - resid_y.mean()))
    sd_t = np.sqrt(var_t)
    sd_y = np.sqrt(var_y)
    cov_tu = r_tu * sd_t
    cov_yu = r_yu * sd_y
    denom = var_t - cov_tu ** 2
    if denom <= 1e-12:
        return np.nan
    return (cov_ty - cov_tu * cov_yu) / denom

X_observed = design_matrix_for_contract(raw, analysis_contract['allowed_covariates'], analysis_contract['categorical_covariates'])
resid_t = residualize(raw[analysis_contract['treatment']], X_observed)
resid_y = residualize(raw[analysis_contract['outcome']], X_observed)
resid_placebo = residualize(raw[analysis_contract['placebo_outcome']], X_observed)

r_values = np.round(np.linspace(-0.60, 0.60, 49), 3)
grid_rows = []
for r_tu in r_values:
    for r_yu in r_values:
        adjusted_coef = coefficient_with_unobserved_confounder(resid_t, resid_y, r_tu, r_yu)
        grid_rows.append({'r_tu': r_tu, 'r_yu': r_yu, 'sensitivity_adjusted_effect': adjusted_coef})
sensitivity_grid = pd.DataFrame(grid_rows)
sensitivity_grid.head()

	r_tu	r_yu	sensitivity_adjusted_effect
0	-0.6	-0.600	-14.525436
1	-0.6	-0.575	-13.284929
2	-0.6	-0.550	-12.044423
3	-0.6	-0.525	-10.803916
4	-0.6	-0.500	-9.563410

zero_crossing = sensitivity_grid.query('sensitivity_adjusted_effect <= 0').copy()
if zero_crossing.empty:
    tipping_point = {'exists_in_grid': False, 'min_correlation_distance': np.nan, 'r_tu': np.nan, 'r_yu': np.nan, 'effect_at_tipping_point': np.nan}
else:
    zero_crossing['correlation_distance'] = np.sqrt(zero_crossing['r_tu'] ** 2 + zero_crossing['r_yu'] ** 2)
    row = zero_crossing.sort_values('correlation_distance').iloc[0]
    tipping_point = {
        'exists_in_grid': True,
        'min_correlation_distance': float(row['correlation_distance']),
        'r_tu': float(row['r_tu']),
        'r_yu': float(row['r_yu']),
        'effect_at_tipping_point': float(row['sensitivity_adjusted_effect']),
    }

tipping_point

{'exists_in_grid': True,
 'min_correlation_distance': 0.6189709201569974,
 'r_tu': 0.425,
 'r_yu': 0.45,
 'effect_at_tipping_point': -0.44501253597670876}

heatmap_data = sensitivity_grid.pivot(index='r_yu', columns='r_tu', values='sensitivity_adjusted_effect')
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(heatmap_data, cmap='coolwarm', center=0, cbar_kws={'label': 'Adjusted effect'}, ax=ax)
ax.set_title('Residual-confounding sensitivity grid')
ax.set_xlabel('Correlation with residual treatment: r_tu')
ax.set_ylabel('Correlation with residual outcome: r_yu')
xticks = np.linspace(0, len(r_values) - 1, 7, dtype=int)
yticks = np.linspace(0, len(r_values) - 1, 7, dtype=int)
ax.set_xticks(xticks + 0.5)
ax.set_xticklabels([f'{r_values[i]:.2f}' for i in xticks], rotation=0)
ax.set_yticks(yticks + 0.5)
ax.set_yticklabels([f'{r_values[i]:.2f}' for i in yticks], rotation=0)
plt.tight_layout()
plt.show()

Discussion

The grid is not saying that a particular hidden confounder exists. It is saying how strong one would need to be, on the residual scale, to change the conclusion.

If the effect reverses under very small residual correlations, the design is fragile. If reversal requires implausibly strong residual correlations, the estimate is more robust to this specific form of hidden confounding. That still does not prove causality.

8. Placebo Outcome Sensitivity

A placebo outcome gives a different lens. Since treatment happens after prior-period revenue, any apparent effect on prior_period_revenue is evidence that treated and untreated accounts differ in ways the observed model has not fully removed.

placebo_summary = {
    'placebo_outcome': analysis_contract['placebo_outcome'],
    'placebo_estimate': placebo_result['estimate'],
    'placebo_robust_se': placebo_result['robust_se'],
    'placebo_ci_low': placebo_result['ci_low'],
    'placebo_ci_high': placebo_result['ci_high'],
    'absolute_placebo_estimate': abs(placebo_result['estimate']),
    'placebo_as_share_of_observed_effect': abs(placebo_result['estimate']) / abs(observed_result['estimate']),
}
placebo_summary

{'placebo_outcome': 'prior_period_revenue',
 'placebo_estimate': 2.1943210894276772,
 'placebo_robust_se': 0.6043799867184138,
 'placebo_ci_low': 1.0097363154595862,
 'placebo_ci_high': 3.3789058633957683,
 'absolute_placebo_estimate': 2.1943210894276772,
 'placebo_as_share_of_observed_effect': 0.2248763474696809}

fig, ax = plt.subplots(figsize=(8, 3.6))
placebo_plot = pd.DataFrame(
    [
        {'estimand': 'main outcome estimate', 'estimate': observed_result['estimate'], 'ci_low': observed_result['ci_low'], 'ci_high': observed_result['ci_high']},
        {'estimand': 'placebo prior outcome', 'estimate': placebo_result['estimate'], 'ci_low': placebo_result['ci_low'], 'ci_high': placebo_result['ci_high']},
    ]
)
ax.errorbar(
    placebo_plot['estimate'],
    placebo_plot['estimand'],
    xerr=[placebo_plot['estimate'] - placebo_plot['ci_low'], placebo_plot['ci_high'] - placebo_plot['estimate']],
    fmt='o',
    color='#228833',
    ecolor='#777777',
    capsize=4,
)
ax.axvline(0, color='black', linewidth=1)
ax.set_title('Main estimate versus placebo outcome estimate')
ax.set_xlabel('Estimated coefficient on onboarding_specialist')
ax.set_ylabel('')
plt.tight_layout()
plt.show()

Discussion

A placebo failure does not tell us the exact bias in the main outcome. It tells us the design can produce an apparent treatment effect where no treatment effect should exist. That should change how cautiously we communicate the main estimate.

9. Sensitivity Gates

We now turn the sensitivity evidence into a decision recommendation. The gates are deliberately conservative.

def classify_sensitivity_gates(observed_result, oracle_result, benchmark_table, placebo_summary, tipping_point, contract):
    thresholds = contract['diagnostic_thresholds']
    max_benchmark_shift = benchmark_table['abs_shift'].max()
    oracle_shift = abs(observed_result['estimate'] - oracle_result['estimate'])
    gates = [
        {
            'gate': 'oracle shift in synthetic lab',
            'metric': 'abs observed-minus-oracle estimate',
            'value': oracle_shift,
            'status': 'caution' if oracle_shift > max_benchmark_shift else 'pass',
            'interpretation': 'How much the estimate changes when hidden confounder is included. Teaching-only diagnostic.',
        },
        {
            'gate': 'largest observed-covariate benchmark',
            'metric': 'max single-covariate shift',
            'value': max_benchmark_shift,
            'status': 'caution' if max_benchmark_shift / abs(observed_result['estimate']) > thresholds['benchmark_shift_caution_ratio'] else 'pass',
            'interpretation': 'Scale of coefficient movement from omitting observed covariates.',
        },
        {
            'gate': 'placebo outcome',
            'metric': 'absolute placebo estimate',
            'value': placebo_summary['absolute_placebo_estimate'],
            'status': 'caution' if placebo_summary['absolute_placebo_estimate'] > thresholds['placebo_caution_revenue_units'] else 'pass',
            'interpretation': 'Apparent effect on pre-treatment outcome.',
        },
        {
            'gate': 'tipping point proximity',
            'metric': 'minimum residual-correlation distance to sign reversal',
            'value': tipping_point['min_correlation_distance'],
            'status': 'caution' if tipping_point['exists_in_grid'] and tipping_point['min_correlation_distance'] < thresholds['tipping_point_distance_caution'] else 'pass',
            'interpretation': 'How close the estimate is to sign reversal under residual confounding grid.',
        },
    ]
    gate_table = pd.DataFrame(gates)
    if (gate_table['status'] == 'caution').sum() >= 2:
        recommendation = 'collect_more_evidence_before_strong_causal_claim'
    elif (gate_table['status'] == 'caution').sum() == 1:
        recommendation = 'report_with_caution_and_sensitivity_limitations'
    else:
        recommendation = 'sensitivity_checks_support_cautious_reporting'
    return gate_table, recommendation

sensitivity_gate_table, sensitivity_recommendation = classify_sensitivity_gates(
    observed_result,
    oracle_result,
    benchmark_table,
    placebo_summary,
    tipping_point,
    analysis_contract,
)
sensitivity_gate_table

	gate	metric	value	status	interpretation
0	oracle shift in synthetic lab	abs observed-minus-oracle estimate	2.421653	caution	How much the estimate changes when hidden confounder is included. Teaching-only diagnostic.
1	largest observed-covariate benchmark	max single-covariate shift	0.283853	pass	Scale of coefficient movement from omitting observed covariates.
2	placebo outcome	absolute placebo estimate	2.194321	pass	Apparent effect on pre-treatment outcome.
3	tipping point proximity	minimum residual-correlation distance to sign reversal	0.618971	pass	How close the estimate is to sign reversal under residual confounding grid.

sensitivity_recommendation

'report_with_caution_and_sensitivity_limitations'

Discussion

The gate recommendation is not a universal rule. It is a reporting discipline. It forces the analyst to say when sensitivity evidence weakens the causal story.

10. Build a Sensitivity Report Bundle

We now assemble the sensitivity evidence into a machine-readable report for AI review.

def dataframe_records(table, max_rows=30):
    return json.loads(table.head(max_rows).to_json(orient='records', date_format='iso'))

sensitivity_report = {
    'analysis_contract': analysis_contract,
    'truth_available_only_because_synthetic': truth,
    'observed_adjusted_estimate': {k: v for k, v in observed_result.items() if k != 'model'},
    'oracle_hidden_confounder_estimate': {k: v for k, v in oracle_result.items() if k != 'model'},
    'placebo_estimate': {k: v for k, v in placebo_result.items() if k != 'model'},
    'benchmark_covariate_table': dataframe_records(benchmark_table, max_rows=12),
    'tipping_point': tipping_point,
    'sensitivity_grid_summary': {
        'grid_min_effect': float(sensitivity_grid['sensitivity_adjusted_effect'].min()),
        'grid_max_effect': float(sensitivity_grid['sensitivity_adjusted_effect'].max()),
        'share_grid_effect_below_zero': float((sensitivity_grid['sensitivity_adjusted_effect'] <= 0).mean()),
    },
    'placebo_summary': placebo_summary,
    'sensitivity_gates': dataframe_records(sensitivity_gate_table, max_rows=10),
    'sensitivity_recommendation': sensitivity_recommendation,
    'explicit_warnings': [
        'Sensitivity analysis does not prove the effect is causal.',
        'The oracle estimate uses a hidden confounder that exists only in this synthetic teaching example.',
        'A placebo outcome warning should reduce confidence even if the main estimate is statistically significant.',
        'LLM summaries should not convert sensitivity checks into a robustness certificate.',
    ],
}

print(json.dumps(sensitivity_report, indent=2)[:5200])

{
  "analysis_contract": {
    "analysis_id": "onboarding_specialist_sensitivity_v1",
    "estimand": "ATE of onboarding_specialist on net_revenue_90d among eligible accounts",
    "treatment": "onboarding_specialist",
    "outcome": "net_revenue_90d",
    "placebo_outcome": "prior_period_revenue",
    "allowed_covariates": [
      "segment",
      "region",
      "tenure_months",
      "baseline_usage",
      "prior_support_tickets",
      "contract_value",
      "observed_health",
      "risk_score"
    ],
    "categorical_covariates": [
      "segment",
      "region"
    ],
    "oracle_hidden_confounders": [
      "executive_attention"
    ],
    "forbidden_variables": [
      "product_adoption_30d",
      "expansion_pipeline_120d",
      "true_effect"
    ],
    "diagnostic_thresholds": {
      "placebo_caution_revenue_units": 4.0,
      "tipping_point_distance_caution": 0.35,
      "benchmark_shift_caution_ratio": 1.0
    }
  },
  "truth_available_only_because_synthetic": {
    "true_ate": 8.388275446972925,
    "true_att": 8.50067894836023,
    "treatment_rate": 0.685,
    "hidden_strength": 0.65
  },
  "observed_adjusted_estimate": {
    "formula": "net_revenue_90d ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_value + observed_health + risk_score",
    "estimate": 9.757900793561795,
    "robust_se": 0.729625671578333,
    "ci_low": 8.327834477268262,
    "ci_high": 11.187967109855329,
    "p_value": 8.597863559654592e-41,
    "rows": 5200
  },
  "oracle_hidden_confounder_estimate": {
    "formula": "net_revenue_90d ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_value + observed_health + risk_score + executive_attention",
    "estimate": 7.336248179580361,
    "robust_se": 0.7170837832752662,
    "ci_low": 5.930763964360839,
    "ci_high": 8.741732394799882,
    "p_value": 1.4451474851989958e-24,
    "rows": 5200
  },
  "placebo_estimate": {
    "formula": "prior_period_revenue ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_value + observed_health + risk_score",
    "estimate": 2.1943210894276772,
    "robust_se": 0.6043799867184138,
    "ci_low": 1.0097363154595862,
    "ci_high": 3.3789058633957683,
    "p_value": 0.0002826559845974888,
    "rows": 5200
  },
  "benchmark_covariate_table": [
    {
      "omitted_covariate": "risk_score",
      "estimate_without_covariate": 10.041754266,
      "shift_from_full_model": 0.2838534725,
      "abs_shift": 0.2838534725,
      "shift_as_share_of_estimate": 0.0290896043
    },
    {
      "omitted_covariate": "contract_value",
      "estimate_without_covariate": 10.013070119,
      "shift_from_full_model": 0.2551693254,
      "abs_shift": 0.2551693254,
      "shift_as_share_of_estimate": 0.0261500225
    },
    {
      "omitted_covariate": "observed_health",
      "estimate_without_covariate": 9.734774011,
      "shift_from_full_model": -0.0231267825,
      "abs_shift": 0.0231267825,
      "shift_as_share_of_estimate": 0.0023700571
    },
    {
      "omitted_covariate": "segment",
      "estimate_without_covariate": 9.7769572867,
      "shift_from_full_model": 0.0190564932,
      "abs_shift": 0.0190564932,
      "shift_as_share_of_estimate": 0.0019529296
    },
    {
      "omitted_covariate": "prior_support_tickets",
      "estimate_without_covariate": 9.7415246054,
      "shift_from_full_model": -0.0163761881,
      "abs_shift": 0.0163761881,
      "shift_as_share_of_estimate": 0.0016782491
    },
    {
      "omitted_covariate": "region",
      "estimate_without_covariate": 9.7685447453,
      "shift_from_full_model": 0.0106439518,
      "abs_shift": 0.0106439518,
      "shift_as_share_of_estimate": 0.0010908034
    },
    {
      "omitted_covariate": "tenure_months",
      "estimate_without_covariate": 9.765030247,
      "shift_from_full_model": 0.0071294534,
      "abs_shift": 0.0071294534,
      "shift_as_share_of_estimate": 0.0007306339
    },
    {
      "omitted_covariate": "baseline_usage",
      "estimate_without_covariate": 9.7522938182,
      "shift_from_full_model": -0.0056069753,
      "abs_shift": 0.0056069753,
      "shift_as_share_of_estimate": 0.0005746088
    }
  ],
  "tipping_point": {
    "exists_in_grid": true,
    "min_correlation_distance": 0.6189709201569974,
    "r_tu": 0.425,
    "r_yu": 0.45,
    "effect_at_tipping_point": -0.44501253597670876
  },
  "sensitivity_grid_summary": {
    "grid_min_effect": -14.525435716611746,
    "grid_max_effect": 45.01887569649227,
    "share_grid_effect_below_zero": 0.07913369429404415
  },
  "placebo_summary": {
    "placebo_outcome": "prior_period_revenue",
    "placebo_estimate": 2.1943210894276772,
    "placebo_robust_se": 0.6043799867184138,
    "placebo_ci_low": 1.0097363154595862,
    "placebo_ci_high": 3.3789058633957683,
    "absolute_placebo_estimate": 2.1943210894276772,
    "placebo_as_share_of_observed_effect": 0.2248763474696809
  },
  "sensitivity_gates": [
    {
      "gate": "oracle shift in synthetic lab",
      "metric": "abs observed-minus-oracle estimate",
      "

Discussion

The report bundle contains numbers, gates, and warnings. This is the right object for an AI assistant to summarize. It is far safer than asking, “Is my result robust?”

11. Shared Local LLM Runtime

We reuse the shared local LLM utilities from notebooks/_shared so model loading and cleanup behavior stays consistent with earlier notebooks.

def find_project_root(start=None):
    start = Path(start or Path.cwd()).resolve()
    for candidate in [start, *start.parents]:
        if (candidate / 'pyproject.toml').exists() and (candidate / 'notebooks').exists():
            return candidate
    return start

PROJECT_ROOT = find_project_root()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from notebooks._shared.local_llm import (
    clean_generated_text,
    clear_loaded_model_cache,
    get_device,
    local_chat as _shared_local_chat,
)
from notebooks._shared.structured_outputs import parse_pydantic_output

DEVICE = get_device()


def local_chat(user_message, system_message=None, model_id=MODEL_ID, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE):
    return _shared_local_chat(
        user_message,
        system_message=system_message,
        model_id=model_id,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        seed=SEED,
        enabled=RUN_LIVE_LOCAL_LLM,
    )

print(f'Local LLM device: {DEVICE}')

Local LLM device: cuda

12. Structured AI Sensitivity Review

The model will summarize the sensitivity report. It must not declare the result “robust” unless the gates support that language, and even then it must preserve limitations.

class SensitivityReview(BaseModel):
    analysis_summary: str = Field(description='Brief summary of estimand and design.')
    main_estimate_interpretation: list[str] = Field(description='How to interpret the main adjusted estimate.')
    benchmark_interpretation: list[str] = Field(description='Interpretation of observed-covariate benchmark shifts.')
    placebo_interpretation: list[str] = Field(description='Interpretation of placebo outcome evidence.')
    tipping_point_interpretation: list[str] = Field(description='Interpretation of residual-confounding tipping point.')
    recommendation: Literal['cautious_reporting', 'collect_more_evidence', 'do_not_make_causal_claim'] = Field(description='Recommended communication posture.')
    recommendation_rationale: list[str] = Field(description='Reasons for the recommendation.')
    additional_sensitivity_checks: list[str] = Field(description='Additional checks to run next.')
    communication_warnings: list[str] = Field(description='Warnings for stakeholder-facing language.')
    stakeholder_summary: str = Field(description='Plain-language summary for decision makers.')
    confidence: Literal['low', 'medium', 'high'] = Field(description='Confidence in the review given the report.')

REVIEW_SCALAR_FIELDS = ['analysis_summary', 'recommendation', 'stakeholder_summary', 'confidence']
REVIEW_LIST_FIELDS = [
    'main_estimate_interpretation',
    'benchmark_interpretation',
    'placebo_interpretation',
    'tipping_point_interpretation',
    'recommendation_rationale',
    'additional_sensitivity_checks',
    'communication_warnings',
]
REVIEW_VALUE_ALIASES = {
    'recommendation': {
        'cautious_reporting': 'cautious_reporting',
        'cautiousreporting': 'cautious_reporting',
        'cautious': 'cautious_reporting',
        'caution': 'cautious_reporting',
        'report_cautiously': 'cautious_reporting',
        'proceed_with_caution': 'cautious_reporting',
        'collect_more_evidence': 'collect_more_evidence',
        'collectmoreevidence': 'collect_more_evidence',
        'more_evidence': 'collect_more_evidence',
        'gather_more_evidence': 'collect_more_evidence',
        'additional_evidence': 'collect_more_evidence',
        'do_not_make_causal_claim': 'do_not_make_causal_claim',
        'donotmakecausalclaim': 'do_not_make_causal_claim',
        'do_not_claim_causality': 'do_not_make_causal_claim',
        'no_causal_claim': 'do_not_make_causal_claim',
        'avoid_causal_claim': 'do_not_make_causal_claim',
    },
    'confidence': {
        'low': 'low',
        'medium': 'medium',
        'moderate': 'medium',
        'high': 'high',
    },
}

REVIEW_DEFAULTS = {
    'analysis_summary': '',
    'main_estimate_interpretation': [],
    'benchmark_interpretation': [],
    'placebo_interpretation': [],
    'tipping_point_interpretation': [],
    'recommendation_rationale': [],
    'additional_sensitivity_checks': [],
    'communication_warnings': [],
    'stakeholder_summary': '',
    'confidence': 'medium',
}

REVIEW_ALIASES = {
    'summary': 'analysis_summary',
    'main_estimate': 'main_estimate_interpretation',
    'benchmark': 'benchmark_interpretation',
    'placebo': 'placebo_interpretation',
    'tipping_point': 'tipping_point_interpretation',
    'rationale': 'recommendation_rationale',
    'next_checks': 'additional_sensitivity_checks',
    'warnings': 'communication_warnings',
}


def parse_sensitivity_review(raw_output):
    result = parse_pydantic_output(
        raw_output,
        SensitivityReview,
        scalar_fields=REVIEW_SCALAR_FIELDS,
        list_fields=REVIEW_LIST_FIELDS,
        field_aliases=REVIEW_ALIASES,
        value_aliases=REVIEW_VALUE_ALIASES,
        defaults=REVIEW_DEFAULTS,
    )
    return result.parsed, result.json_text, result.notes

SYSTEM_REVIEW_MESSAGE = (
    'You are a careful causal inference sensitivity-analysis reviewer.\n'
    'Rules:\n'
    '- Use only the provided sensitivity report.\n'
    '- Return valid JSON only. No markdown. No preamble.\n'
    '- Do not claim sensitivity analysis proves causality.\n'
    '- Mention benchmark confounders, placebo outcome evidence, tipping point, and oracle limitations when available.\n'
    '- Do not treat synthetic oracle columns as available in real data.\n'
    '- If placebo or tipping-point gates are cautionary, do not recommend strong causal claims.\n'
    '- Mention brittleness and rerun/model-output instability where appropriate.'
)


def build_review_prompt(report):
    schema_hint = {
        'analysis_summary': 'string',
        'main_estimate_interpretation': ['string'],
        'benchmark_interpretation': ['string'],
        'placebo_interpretation': ['string'],
        'tipping_point_interpretation': ['string'],
        'recommendation': 'cautious_reporting | collect_more_evidence | do_not_make_causal_claim',
        'recommendation_rationale': ['string'],
        'additional_sensitivity_checks': ['string'],
        'communication_warnings': ['string'],
        'stakeholder_summary': 'string',
        'confidence': 'low | medium | high',
    }
    return textwrap.dedent(
        f'''
        Produce a SensitivityReview JSON object using this schema.
        Use the exact enum strings shown in the schema. Every field is required; use [] when no useful item is available.
        Return one JSON object only.
        {json.dumps(schema_hint, indent=2)}

        Sensitivity report:
        {json.dumps(report, indent=2)}
        '''
    ).strip()

review_prompt = build_review_prompt(sensitivity_report)
print(review_prompt[:3000])

Produce a SensitivityReview JSON object using this schema.
        Use the exact enum strings shown in the schema. Every field is required; use [] when no useful item is available.
        Return one JSON object only.
        {
  "analysis_summary": "string",
  "main_estimate_interpretation": [
    "string"
  ],
  "benchmark_interpretation": [
    "string"
  ],
  "placebo_interpretation": [
    "string"
  ],
  "tipping_point_interpretation": [
    "string"
  ],
  "recommendation": "cautious_reporting | collect_more_evidence | do_not_make_causal_claim",
  "recommendation_rationale": [
    "string"
  ],
  "additional_sensitivity_checks": [
    "string"
  ],
  "communication_warnings": [
    "string"
  ],
  "stakeholder_summary": "string",
  "confidence": "low | medium | high"
}

        Sensitivity report:
        {
  "analysis_contract": {
    "analysis_id": "onboarding_specialist_sensitivity_v1",
    "estimand": "ATE of onboarding_specialist on net_revenue_90d among eligible accounts",
    "treatment": "onboarding_specialist",
    "outcome": "net_revenue_90d",
    "placebo_outcome": "prior_period_revenue",
    "allowed_covariates": [
      "segment",
      "region",
      "tenure_months",
      "baseline_usage",
      "prior_support_tickets",
      "contract_value",
      "observed_health",
      "risk_score"
    ],
    "categorical_covariates": [
      "segment",
      "region"
    ],
    "oracle_hidden_confounders": [
      "executive_attention"
    ],
    "forbidden_variables": [
      "product_adoption_30d",
      "expansion_pipeline_120d",
      "true_effect"
    ],
    "diagnostic_thresholds": {
      "placebo_caution_revenue_units": 4.0,
      "tipping_point_distance_caution": 0.35,
      "benchmark_shift_caution_ratio": 1.0
    }
  },
  "truth_available_only_because_synthetic": {
    "true_ate": 8.388275446972925,
    "true_att": 8.50067894836023,
    "treatment_rate": 0.685,
    "hidden_strength": 0.65
  },
  "observed_adjusted_estimate": {
    "formula": "net_revenue_90d ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_value + observed_health + risk_score",
    "estimate": 9.757900793561795,
    "robust_se": 0.729625671578333,
    "ci_low": 8.327834477268262,
    "ci_high": 11.187967109855329,
    "p_value": 8.597863559654592e-41,
    "rows": 5200
  },
  "oracle_hidden_confounder_estimate": {
    "formula": "net_revenue_90d ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_value + observed_health + risk_score + executive_attention",
    "estimate": 7.336248179580361,
    "robust_se": 0.7170837832752662,
    "ci_low": 5.930763964360839,
    "ci_high": 8.741732394799882,
    "p_value": 1.4451474851989958e-24,
    "rows": 5200
  },
  "placebo_estimate": {
    "formula": "prior_period_revenue ~ onboarding_specialist + C(segment) + C(region) + tenure_months + baseline_usage + prior_support_tickets + contract_

13. Run One Local Model

We start with the fast local model. A good answer should summarize the sensitivity evidence without turning it into a robustness certificate.

if RUN_LIVE_LOCAL_LLM:
    raw_review_output = local_chat(
        review_prompt,
        system_message=SYSTEM_REVIEW_MESSAGE,
        model_id=MODEL_ID,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
    )
    display(Markdown('### Raw model output'))
    display(Markdown(f'```json\n{clean_generated_text(raw_review_output)}\n```'))
else:
    raw_review_output = ''
    print('RUN_LIVE_LOCAL_LLM is False. Skipping live local LLM call in the rendered notebook.')

Raw model output

{
 "analysis_summary": "The analysis evaluates the ATE of onboarding specialists on net revenue 90 days post-treatment, adjusting for several covariates. The observed adjusted estimate suggests a positive effect, but sensitivity analyses indicate potential confounding.",
 "main_estimate_interpretation": [
 "The observed adjusted estimate indicates a positive effect of onboarding specialists on net revenue 90 days post-treatment."
 ],
 "benchmark_interpretation": [
 "Omitting 'risk_score' and 'contract_value' results in the largest shifts in the estimate, but these shifts are relatively small compared to the overall estimate."
 ],
 "placebo_interpretation": [
 "The placebo outcome shows a non-zero effect, which is a cautionary signal as it should be close to zero if there were no confounding."
 ],
 "tipping_point_interpretation": [
 "The tipping point is relatively far from the current estimate, suggesting some robustness to unmeasured confounding."
 ],
 "recommendation": "cautious_reporting",
 "recommendation_rationale": [
 "The placebo outcome shows a non-zero effect, indicating potential confounding issues.",
 "The oracle shift in synthetic lab is at the caution threshold, suggesting the estimate could change significantly with additional confounding."
 ],
 "additional_sensitivity_checks": [
 "Conduct further sensitivity analyses with different sets of covariates.",
 "Explore the impact of interaction terms between covariates."
 ],
 "communication_warnings": [
 "Avoid making strong causal claims based solely on this analysis.",
 "Emphasize the limitations of the sensitivity analysis and the need for further validation."
 ],
 "stakeholder_summary": "The analysis suggests a positive effect of onboarding specialists on net revenue, but sensitivity checks indicate potential confounding. Further investigation is recommended before making definitive causal claims.",
 "confidence": "medium"
}

if raw_review_output:
    try:
        sensitivity_review, sensitivity_review_json, review_parser_notes = parse_sensitivity_review(raw_review_output)
        single_model_parse_error = ''
        display(Markdown('### Parsed sensitivity review'))
        display(Markdown(f'```json\n{sensitivity_review.model_dump_json(indent=2)}\n```'))
        print('Parser notes:', review_parser_notes)
    except Exception as error:
        sensitivity_review = None
        sensitivity_review_json = ''
        review_parser_notes = []
        single_model_parse_error = clean_generated_text(repr(error))
        print('The single-model output could not be parsed. This is a useful brittleness signal, not a reason to trust the output silently.')
        print(single_model_parse_error[:1200])
else:
    sensitivity_review = None
    sensitivity_review_json = ''
    review_parser_notes = []
    single_model_parse_error = ''

Parsed sensitivity review

{
  "analysis_summary": "The analysis evaluates the ATE of onboarding specialists on net revenue 90 days post-treatment, adjusting for several covariates. The observed adjusted estimate suggests a positive effect, but sensitivity analyses indicate potential confounding.",
  "main_estimate_interpretation": [
    "The observed adjusted estimate indicates a positive effect of onboarding specialists on net revenue 90 days post-treatment."
  ],
  "benchmark_interpretation": [
    "Omitting 'risk_score' and 'contract_value' results in the largest shifts in the estimate, but these shifts are relatively small compared to the overall estimate."
  ],
  "placebo_interpretation": [
    "The placebo outcome shows a non-zero effect, which is a cautionary signal as it should be close to zero if there were no confounding."
  ],
  "tipping_point_interpretation": [
    "The tipping point is relatively far from the current estimate, suggesting some robustness to unmeasured confounding."
  ],
  "recommendation": "cautious_reporting",
  "recommendation_rationale": [
    "The placebo outcome shows a non-zero effect, indicating potential confounding issues.",
    "The oracle shift in synthetic lab is at the caution threshold, suggesting the estimate could change significantly with additional confounding."
  ],
  "additional_sensitivity_checks": [
    "Conduct further sensitivity analyses with different sets of covariates.",
    "Explore the impact of interaction terms between covariates."
  ],
  "communication_warnings": [
    "Avoid making strong causal claims based solely on this analysis.",
    "Emphasize the limitations of the sensitivity analysis and the need for further validation."
  ],
  "stakeholder_summary": "The analysis suggests a positive effect of onboarding specialists on net revenue, but sensitivity checks indicate potential confounding. Further investigation is recommended before making definitive causal claims.",
  "confidence": "medium"
}

Parser notes: []

14. Audit the AI Sensitivity Review

A good sensitivity review must mention the actual sensitivity evidence and avoid causal overclaiming.

def review_text(review):
    if review is None:
        return ''
    return clean_generated_text(json.dumps(review.model_dump(), sort_keys=True)).lower()


def contains_any(text, terms):
    return any(term.lower() in text for term in terms)


def expected_recommendation_from_sensitivity(gates):
    cautions = (gates['status'] == 'caution').sum()
    if cautions >= 2:
        return 'collect_more_evidence'
    if cautions == 1:
        return 'cautious_reporting'
    return 'cautious_reporting'

expected_recommendation = expected_recommendation_from_sensitivity(sensitivity_gate_table)


def recommendation_not_less_cautious(model_rec, expected_rec):
    order = {'cautious_reporting': 0, 'collect_more_evidence': 1, 'do_not_make_causal_claim': 2}
    return order.get(model_rec, -1) >= order.get(expected_rec, 0)


def score_sensitivity_review(review, expected_recommendation):
    text = review_text(review)
    recommendation = review.recommendation if review is not None else ''
    checks = {
        'states estimand': analysis_contract['treatment'] in text and analysis_contract['outcome'] in text,
        'mentions benchmark confounders': contains_any(text, ['benchmark', 'observed covariate', 'omitted covariate']),
        'mentions placebo outcome': contains_any(text, ['placebo', analysis_contract['placebo_outcome']]),
        'mentions tipping point': contains_any(text, ['tipping', 'residual confounding', 'sign reversal']),
        'mentions oracle limitation': contains_any(text, ['oracle', 'synthetic', 'hidden confounder', 'not available in real data']),
        'recommendation is not less cautious than gates': recommendation_not_less_cautious(recommendation, expected_recommendation),
        'does not claim sensitivity proves causality': not contains_any(text, ['proves causality', 'proves the effect', 'certifies robustness', 'eliminates unmeasured confounding']),
        'includes communication warnings': len(review.communication_warnings) >= 2 if review is not None else False,
        'includes additional checks': len(review.additional_sensitivity_checks) >= 2 if review is not None else False,
        'mentions brittleness or caution': contains_any(text, ['caution', 'brittle', 'sensitivity', 'fragile', 'uncertain']),
    }
    audit = pd.DataFrame([{'check': key, 'passed': bool(value)} for key, value in checks.items()])
    audit['credit'] = audit['passed'].astype(int)
    return audit, int(audit['credit'].sum()), len(audit)

if sensitivity_review is not None:
    single_audit, single_score, single_max_score = score_sensitivity_review(sensitivity_review, expected_recommendation)
    print(f'Expected recommendation from gates: {expected_recommendation}')
    print(f'Score: {single_score}/{single_max_score}')
    display(single_audit)
else:
    print('No sensitivity review to audit because live model execution was skipped or parsing failed.')

Expected recommendation from gates: cautious_reporting
Score: 9/10

	check	passed	credit
0	states estimand	False	0
1	mentions benchmark confounders	True	1
2	mentions placebo outcome	True	1
3	mentions tipping point	True	1
4	mentions oracle limitation	True	1
5	recommendation is not less cautious than gates	True	1
6	does not claim sensitivity proves causality	True	1
7	includes communication warnings	True	1
8	includes additional checks	True	1
9	mentions brittleness or caution	True	1

Discussion

The model may be more conservative than the deterministic gates. That is acceptable. It should not be less conservative, and it should never say the sensitivity analysis proves the causal claim.

15. Optional All-Model Comparison

We now compare local models on two sensitivity reports: a moderate-hidden-confounding case and a strong-hidden-confounding case.

The full single-model review above uses a richer narrative schema. For the all-model comparison, we intentionally use a compact decision schema: key risk signals, recommendation, rationale, next checks, warnings, and confidence. This makes the comparison about causal judgment and schema reliability, not about whether every model can produce a long report with many fields.

The exact ranking can change across reruns and environments. That is expected. The durable lesson is that sensitivity-review assistants need to be evaluated against the numeric gates, not their confidence of tone. Remaining schema failures are useful evidence about which models are not reliable enough for automated causal-review workflows without additional guardrails.

def build_sensitivity_report_for_scenario(case_id, hidden_strength, overlap_noise, seed):
    scenario_df, scenario_truth = simulate_onboarding_data(seed=seed, hidden_strength=hidden_strength, overlap_noise=overlap_noise)
    obs = fit_effect_model(
        scenario_df,
        analysis_contract['outcome'],
        analysis_contract['treatment'],
        analysis_contract['allowed_covariates'],
        analysis_contract['categorical_covariates'],
    )
    oracle = fit_effect_model(
        scenario_df,
        analysis_contract['outcome'],
        analysis_contract['treatment'],
        analysis_contract['allowed_covariates'] + analysis_contract['oracle_hidden_confounders'],
        analysis_contract['categorical_covariates'],
    )
    placebo = fit_effect_model(
        scenario_df,
        analysis_contract['placebo_outcome'],
        analysis_contract['treatment'],
        analysis_contract['allowed_covariates'],
        analysis_contract['categorical_covariates'],
    )
    bench_rows = []
    for covariate in analysis_contract['allowed_covariates']:
        covariates = [var for var in analysis_contract['allowed_covariates'] if var != covariate]
        categorical = [var for var in analysis_contract['categorical_covariates'] if var in covariates]
        reduced = fit_effect_model(scenario_df, analysis_contract['outcome'], analysis_contract['treatment'], covariates, categorical)
        shift = reduced['estimate'] - obs['estimate']
        bench_rows.append({'omitted_covariate': covariate, 'estimate_without_covariate': reduced['estimate'], 'shift_from_full_model': shift, 'abs_shift': abs(shift), 'shift_as_share_of_estimate': abs(shift) / abs(obs['estimate'])})
    bench = pd.DataFrame(bench_rows).sort_values('abs_shift', ascending=False)

    X = design_matrix_for_contract(scenario_df, analysis_contract['allowed_covariates'], analysis_contract['categorical_covariates'])
    rt = residualize(scenario_df[analysis_contract['treatment']], X)
    ry = residualize(scenario_df[analysis_contract['outcome']], X)
    rows = []
    for r_tu in r_values:
        for r_yu in r_values:
            rows.append({'r_tu': r_tu, 'r_yu': r_yu, 'sensitivity_adjusted_effect': coefficient_with_unobserved_confounder(rt, ry, r_tu, r_yu)})
    grid = pd.DataFrame(rows)
    zero = grid.query('sensitivity_adjusted_effect <= 0').copy()
    if zero.empty:
        tip = {'exists_in_grid': False, 'min_correlation_distance': np.nan, 'r_tu': np.nan, 'r_yu': np.nan, 'effect_at_tipping_point': np.nan}
    else:
        zero['correlation_distance'] = np.sqrt(zero['r_tu'] ** 2 + zero['r_yu'] ** 2)
        row = zero.sort_values('correlation_distance').iloc[0]
        tip = {'exists_in_grid': True, 'min_correlation_distance': float(row['correlation_distance']), 'r_tu': float(row['r_tu']), 'r_yu': float(row['r_yu']), 'effect_at_tipping_point': float(row['sensitivity_adjusted_effect'])}
    placebo_sum = {
        'placebo_outcome': analysis_contract['placebo_outcome'],
        'placebo_estimate': placebo['estimate'],
        'placebo_robust_se': placebo['robust_se'],
        'placebo_ci_low': placebo['ci_low'],
        'placebo_ci_high': placebo['ci_high'],
        'absolute_placebo_estimate': abs(placebo['estimate']),
        'placebo_as_share_of_observed_effect': abs(placebo['estimate']) / abs(obs['estimate']),
    }
    gates, rec = classify_sensitivity_gates(obs, oracle, bench, placebo_sum, tip, analysis_contract)
    report = {
        'case_id': case_id,
        'analysis_contract': analysis_contract,
        'truth_available_only_because_synthetic': scenario_truth,
        'observed_adjusted_estimate': {k: v for k, v in obs.items() if k != 'model'},
        'oracle_hidden_confounder_estimate': {k: v for k, v in oracle.items() if k != 'model'},
        'placebo_estimate': {k: v for k, v in placebo.items() if k != 'model'},
        'benchmark_covariate_table': dataframe_records(bench, max_rows=12),
        'tipping_point': tip,
        'sensitivity_grid_summary': {
            'grid_min_effect': float(grid['sensitivity_adjusted_effect'].min()),
            'grid_max_effect': float(grid['sensitivity_adjusted_effect'].max()),
            'share_grid_effect_below_zero': float((grid['sensitivity_adjusted_effect'] <= 0).mean()),
        },
        'placebo_summary': placebo_sum,
        'sensitivity_gates': dataframe_records(gates, max_rows=10),
        'sensitivity_recommendation': rec,
        'explicit_warnings': sensitivity_report['explicit_warnings'],
    }
    return report, expected_recommendation_from_sensitivity(gates)

moderate_report, moderate_expected = build_sensitivity_report_for_scenario('moderate_hidden_confounding', 0.35, 0.90, SEED + 1)
strong_report, strong_expected = build_sensitivity_report_for_scenario('strong_hidden_confounding', 1.05, 0.55, SEED + 2)

SENSITIVITY_EVAL_CASES = [
    {'case_id': 'moderate_hidden_confounding', 'report': moderate_report, 'expected_recommendation': moderate_expected},
    {'case_id': 'strong_hidden_confounding', 'report': strong_report, 'expected_recommendation': strong_expected},
]
[(case['case_id'], case['expected_recommendation']) for case in SENSITIVITY_EVAL_CASES]

[('moderate_hidden_confounding', 'cautious_reporting'),
 ('strong_hidden_confounding', 'collect_more_evidence')]

SUMMARY_COLUMNS = [
    'label', 'model_id', 'role', 'cases', 'schema_valid_cases', 'schema_repaired_cases',
    'schema_reliability', 'mean_review_score', 'failure_types'
]
CASE_RESULT_COLUMNS = [
    'label', 'model_id', 'role', 'case_id', 'expected_recommendation', 'model_recommendation', 'status',
    'schema_valid', 'repair_used', 'repair_stage', 'error_type', 'review_score', 'max_review_score',
    'review_score_share', 'error', 'raw_output_preview'
]

class SensitivityDecision(BaseModel):
    analysis_summary: str = Field(description='One-sentence summary of the sensitivity-analysis situation.')
    key_risk_signals: list[str] = Field(description='Most important sensitivity, placebo, benchmark, or tipping-point signals.')
    recommendation: Literal['cautious_reporting', 'collect_more_evidence', 'do_not_make_causal_claim'] = Field(description='Recommended communication posture.')
    recommendation_rationale: list[str] = Field(description='Reasons for the recommendation.')
    additional_sensitivity_checks: list[str] = Field(description='Additional checks to run next.')
    communication_warnings: list[str] = Field(description='Warnings for stakeholder-facing language.')
    confidence: Literal['low', 'medium', 'high'] = Field(description='Confidence in the decision given the report.')

DECISION_SCALAR_FIELDS = ['analysis_summary', 'recommendation', 'confidence']
DECISION_LIST_FIELDS = [
    'key_risk_signals',
    'recommendation_rationale',
    'additional_sensitivity_checks',
    'communication_warnings',
]
DECISION_ALIASES = {
    'summary': 'analysis_summary',
    'risk_signals': 'key_risk_signals',
    'risks': 'key_risk_signals',
    'signals': 'key_risk_signals',
    'rationale': 'recommendation_rationale',
    'reasons': 'recommendation_rationale',
    'next_checks': 'additional_sensitivity_checks',
    'checks': 'additional_sensitivity_checks',
    'warnings': 'communication_warnings',
}
DECISION_DEFAULTS = {
    'analysis_summary': '',
    'key_risk_signals': [],
    'recommendation_rationale': [],
    'additional_sensitivity_checks': [],
    'communication_warnings': [],
    'confidence': 'medium',
}

DECISION_REPAIR_PROMPT_TEMPLATE = textwrap.dedent(
    '''
    Your previous answer could not be parsed as the required SensitivityDecision JSON schema.
    Convert the previous answer into valid JSON only. Do not add new causal claims.
    Use exact enum strings from the schema. Keep missing non-critical list fields as empty arrays.

    Required schema:
    {
      "analysis_summary": "string",
      "key_risk_signals": ["string"],
      "recommendation": "cautious_reporting | collect_more_evidence | do_not_make_causal_claim",
      "recommendation_rationale": ["string"],
      "additional_sensitivity_checks": ["string"],
      "communication_warnings": ["string"],
      "confidence": "low | medium | high"
    }

    Parser error:
    {error_message}

    Previous answer:
    {raw_output}
    '''
).strip()


def compact_report_for_model_comparison(report):
    contract = report['analysis_contract']
    return {
        'estimand': contract['estimand'],
        'treatment': contract['treatment'],
        'outcome': contract['outcome'],
        'placebo_outcome': contract['placebo_outcome'],
        'main_estimate': report['observed_adjusted_estimate'],
        'oracle_estimate_synthetic_only': report['oracle_hidden_confounder_estimate'],
        'placebo_summary': report['placebo_summary'],
        'tipping_point': report['tipping_point'],
        'sensitivity_grid_summary': report['sensitivity_grid_summary'],
        'top_benchmark_shifts': report['benchmark_covariate_table'][:3],
        'sensitivity_gates': report['sensitivity_gates'],
        'recommended_posture_from_rules': report.get('sensitivity_recommendation', report.get('recommendation', '')),
        'brittleness_note': 'Local LLM outputs may vary across reruns; assess schema validity separately from causal reasoning quality.',
    }


def build_decision_prompt(report):
    schema_hint = {
        'analysis_summary': 'string',
        'key_risk_signals': ['string'],
        'recommendation': 'cautious_reporting | collect_more_evidence | do_not_make_causal_claim',
        'recommendation_rationale': ['string'],
        'additional_sensitivity_checks': ['string'],
        'communication_warnings': ['string'],
        'confidence': 'low | medium | high',
    }
    return textwrap.dedent(
        f'''
        Produce one compact SensitivityDecision JSON object using this schema.
        Use the exact enum strings shown in the schema. Every field is required; use [] when no useful item is available.
        Return one JSON object only. Keep each list to 2-4 concise strings.
        {json.dumps(schema_hint, indent=2)}

        Compact sensitivity report:
        {json.dumps(compact_report_for_model_comparison(report), indent=2)}
        '''
    ).strip()


def parse_sensitivity_decision(raw_output):
    result = parse_pydantic_output(
        raw_output,
        SensitivityDecision,
        scalar_fields=DECISION_SCALAR_FIELDS,
        list_fields=DECISION_LIST_FIELDS,
        field_aliases=DECISION_ALIASES,
        value_aliases=REVIEW_VALUE_ALIASES,
        defaults=DECISION_DEFAULTS,
    )
    return result.parsed, result.json_text, result.notes


def classify_structured_output_failure(error):
    text = clean_generated_text(repr(error)).lower()
    if 'empty model output' in text:
        return 'empty_output'
    if 'field required' in text or 'missing' in text:
        return 'missing_required_field'
    if 'input should be' in text or 'validation error' in text:
        return 'wrong_field_type_or_schema'
    if 'invalid json' in text or 'expecting value' in text or 'eof' in text or 'jsondecodeerror' in text:
        return 'invalid_json_or_truncated_output'
    return 'other_structured_output_error'


def empty_sensitivity_decision():
    return SensitivityDecision(
        analysis_summary='',
        key_risk_signals=[],
        recommendation='cautious_reporting',
        recommendation_rationale=[],
        additional_sensitivity_checks=[],
        communication_warnings=[],
        confidence='low',
    )


def parse_or_repair_decision(raw_output, model_id):
    if not clean_generated_text(raw_output):
        raise ValueError('empty model output')
    try:
        parsed, parsed_json, notes = parse_sensitivity_decision(raw_output)
        return {
            'parsed': parsed,
            'parsed_json': parsed_json,
            'parser_notes': notes,
            'repair_used': bool(notes),
            'repair_stage': 'parser' if notes else 'none',
            'repaired_raw_output': '',
        }
    except Exception as first_error:
        if not (RUN_SCHEMA_REPAIR_RETRY and RUN_LIVE_LOCAL_LLM):
            raise
        repair_prompt = DECISION_REPAIR_PROMPT_TEMPLATE.format(
            raw_output=clean_generated_text(raw_output)[:5000],
            error_message=clean_generated_text(repr(first_error))[:1200],
        )
        repaired_raw_output = local_chat(
            repair_prompt,
            system_message=SYSTEM_REVIEW_MESSAGE,
            model_id=model_id,
            max_new_tokens=900,
            temperature=TEMPERATURE,
        )
        parsed, parsed_json, notes = parse_sensitivity_decision(repaired_raw_output)
        return {
            'parsed': parsed,
            'parsed_json': parsed_json,
            'parser_notes': [f'first_parse_error: {classify_structured_output_failure(first_error)}'] + notes,
            'repair_used': True,
            'repair_stage': 'model_retry',
            'repaired_raw_output': repaired_raw_output,
        }


def run_single_model_sensitivity_case(label, model_id, role, case):
    prompt = build_decision_prompt(case['report'])
    raw_output = ''
    max_score = score_sensitivity_review(empty_sensitivity_decision(), case['expected_recommendation'])[2]
    try:
        raw_output = local_chat(
            prompt,
            system_message=SYSTEM_REVIEW_MESSAGE,
            model_id=model_id,
            max_new_tokens=900,
            temperature=TEMPERATURE,
        )
        parsed_result = parse_or_repair_decision(raw_output, model_id)
        audit, score, max_score = score_sensitivity_review(parsed_result['parsed'], case['expected_recommendation'])
        return {
            'label': label,
            'model_id': model_id,
            'role': role,
            'case_id': case['case_id'],
            'expected_recommendation': case['expected_recommendation'],
            'model_recommendation': parsed_result['parsed'].recommendation,
            'status': 'ok',
            'schema_valid': True,
            'repair_used': parsed_result['repair_used'],
            'repair_stage': parsed_result['repair_stage'],
            'error_type': '',
            'review_score': score,
            'max_review_score': max_score,
            'review_score_share': score / max_score if max_score else 0.0,
            'error': '',
            'raw_output_preview': clean_generated_text(raw_output)[:500],
        }
    except Exception as error:
        return {
            'label': label,
            'model_id': model_id,
            'role': role,
            'case_id': case['case_id'],
            'expected_recommendation': case['expected_recommendation'],
            'model_recommendation': '',
            'status': 'failed',
            'schema_valid': False,
            'repair_used': False,
            'repair_stage': 'failed',
            'error_type': classify_structured_output_failure(error),
            'review_score': 0,
            'max_review_score': max_score,
            'review_score_share': 0.0,
            'error': clean_generated_text(repr(error))[:900],
            'raw_output_preview': clean_generated_text(raw_output)[:500],
        }


def summarize_model_results(case_results):
    if case_results.empty:
        return pd.DataFrame(columns=SUMMARY_COLUMNS)
    summary = (
        case_results
        .groupby(['label', 'model_id', 'role'], as_index=False)
        .agg(
            cases=('case_id', 'count'),
            schema_valid_cases=('schema_valid', 'sum'),
            schema_repaired_cases=('repair_used', 'sum'),
            mean_review_score=('review_score_share', 'mean'),
            failure_types=('error_type', lambda values: sorted({value for value in values if value})),
        )
    )
    summary['schema_reliability'] = summary['schema_valid_cases'] / summary['cases']
    return summary[SUMMARY_COLUMNS].sort_values(['mean_review_score', 'schema_reliability'], ascending=False)


def run_all_model_sensitivity_comparison(models_to_compare=MODELS_TO_COMPARE, cases=SENSITIVITY_EVAL_CASES):
    rows = []
    for label, model_id, role in models_to_compare:
        print(f'Running {label}: {model_id}')
        for case in cases[:MODEL_COMPARISON_CASE_LIMIT]:
            rows.append(run_single_model_sensitivity_case(label, model_id, role, case))
        clear_loaded_model_cache()
    case_results = pd.DataFrame(rows, columns=CASE_RESULT_COLUMNS)
    summary = summarize_model_results(case_results)
    return summary, case_results

if RUN_FULL_MODEL_COMPARISON and RUN_LIVE_LOCAL_LLM:
    sensitivity_model_summary, sensitivity_case_results = run_all_model_sensitivity_comparison()
else:
    sensitivity_model_summary = pd.DataFrame(columns=SUMMARY_COLUMNS)
    sensitivity_case_results = pd.DataFrame(columns=CASE_RESULT_COLUMNS)
    print('Full model comparison skipped. Set RUN_FULL_MODEL_COMPARISON and RUN_LIVE_LOCAL_LLM to True to run it.')

sensitivity_model_summary

Running Qwen 0.5B: Qwen/Qwen2.5-0.5B-Instruct
Running Qwen 7B: Qwen/Qwen2.5-7B-Instruct
Running Qwen 14B: Qwen/Qwen2.5-14B-Instruct
Running Qwen 32B: Qwen/Qwen2.5-32B-Instruct
Running Phi mini: microsoft/Phi-3.5-mini-instruct

Running Mistral 7B: mistralai/Mistral-7B-Instruct-v0.3
Running Mistral Small 24B: mistralai/Mistral-Small-3.1-24B-Instruct-2503

Running Gemma 3 27B: google/gemma-3-27b-it
Running Llama 3.1 8B: meta-llama/Meta-Llama-3.1-8B-Instruct

	label	model_id	role	cases	schema_valid_cases	schema_repaired_cases	schema_reliability	mean_review_score	failure_types
1	Llama 3.1 8B	meta-llama/Meta-Llama-3.1-8B-Instruct	industry-standard instruct baseline	2	2	0	1.0	0.95	[]
0	Gemma 3 27B	google/gemma-3-27b-it	large non-Qwen comparison	2	2	2	1.0	0.90	[]
4	Phi mini	microsoft/Phi-3.5-mini-instruct	compact non-Qwen comparison	2	2	2	1.0	0.90	[]
3	Mistral Small 24B	mistralai/Mistral-Small-3.1-24B-Instruct-2503	strong non-Qwen comparison	2	2	2	1.0	0.85	[]
7	Qwen 32B	Qwen/Qwen2.5-32B-Instruct	scale comparison	2	2	0	1.0	0.85	[]
6	Qwen 14B	Qwen/Qwen2.5-14B-Instruct	strong local analysis	2	2	0	1.0	0.80	[]
8	Qwen 7B	Qwen/Qwen2.5-7B-Instruct	fast default	2	2	0	1.0	0.80	[]
2	Mistral 7B	mistralai/Mistral-7B-Instruct-v0.3	7B model-family comparison	2	2	1	1.0	0.70	[]
5	Qwen 0.5B	Qwen/Qwen2.5-0.5B-Instruct	pipeline smoke test	2	2	2	1.0	0.40	[]

Inspecting Case-Level Sensitivity Recommendations

This table shows whether the model recommendation was at least as cautious as the sensitivity gates.

sensitivity_case_results[[
    'label',
    'case_id',
    'expected_recommendation',
    'model_recommendation',
    'review_score_share',
    'status',
    'error_type',
]].sort_values(['label', 'case_id']).head(30)

	label	case_id	expected_recommendation	model_recommendation	review_score_share	status
14	Gemma 3 27B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.9	ok
15	Gemma 3 27B	strong_hidden_confounding	collect_more_evidence	cautious_reporting	0.9	ok
16	Llama 3.1 8B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	1.0	ok
17	Llama 3.1 8B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.9	ok
10	Mistral 7B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.6	ok
11	Mistral 7B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.8	ok
12	Mistral Small 24B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.9	ok
13	Mistral Small 24B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.8	ok
8	Phi mini	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.9	ok
9	Phi mini	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.9	ok
0	Qwen 0.5B	moderate_hidden_confounding	cautious_reporting	do_not_make_causal_claim	0.3	ok
1	Qwen 0.5B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.5	ok
4	Qwen 14B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.8	ok
5	Qwen 14B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.8	ok
6	Qwen 32B	moderate_hidden_confounding	cautious_reporting	cautious_reporting	0.9	ok
7	Qwen 32B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.8	ok
2	Qwen 7B	moderate_hidden_confounding	cautious_reporting	collect_more_evidence	0.8	ok
3	Qwen 7B	strong_hidden_confounding	collect_more_evidence	collect_more_evidence	0.8	ok

Inspecting Failed Model Runs

A failed model run receives zero workflow credit instead of producing a NaN score. In sensitivity analysis, invalid or overconfident summaries can hide the most important design warnings.

failed_model_details = sensitivity_case_results.loc[
    ~sensitivity_case_results['schema_valid'],
    ['label', 'case_id', 'status', 'error_type', 'error', 'raw_output_preview'],
].reset_index(drop=True)
failed_model_details

	label	case_id	status	error_type	error	raw_output_preview

Interpreting Repair Counts

Repairs are not automatically failures. They show how much cleanup was needed before the output became usable. A sensitivity-summary assistant that often needs repair should remain firmly under analyst review.

def summarize_repair_stages(case_results):
    if case_results.empty or 'repair_stage' not in case_results.columns:
        return pd.DataFrame(columns=['repair_stage', 'cases'])
    return (
        case_results.assign(repair_stage=case_results['repair_stage'].fillna('none'))
        .groupby('repair_stage', as_index=False)
        .agg(cases=('case_id', 'count'))
        .sort_values('cases', ascending=False)
    )

summarize_repair_stages(sensitivity_case_results)

	repair_stage	cases
0	none	9
1	parser	9

16. Reusable Sensitivity Analysis Checklist

Before reporting an observational causal estimate, the analyst should be able to answer these questions.

sensitivity_checklist = pd.DataFrame(
    [
        {'check': 'Estimand and adjustment set are explicit', 'example': analysis_contract['estimand']},
        {'check': 'Main estimate is shown with uncertainty', 'example': 'observed adjusted coefficient and robust CI'},
        {'check': 'Observed covariates are used as benchmarks', 'example': 'coefficient movement when each covariate is omitted'},
        {'check': 'Placebo or negative-control evidence is reviewed', 'example': analysis_contract['placebo_outcome']},
        {'check': 'Tipping point is reported', 'example': 'minimum residual-correlation distance to sign reversal'},
        {'check': 'Synthetic oracle variables are clearly labeled when present', 'example': 'executive_attention is teaching-only'},
        {'check': 'Communication posture matches sensitivity gates', 'example': sensitivity_recommendation},
        {'check': 'AI summary is audited', 'example': 'score_sensitivity_review'},
    ]
)
sensitivity_checklist

	check	example
0	Estimand and adjustment set are explicit	ATE of onboarding_specialist on net_revenue_90d among eligible accounts
1	Main estimate is shown with uncertainty	observed adjusted coefficient and robust CI
2	Observed covariates are used as benchmarks	coefficient movement when each covariate is omitted
3	Placebo or negative-control evidence is reviewed	prior_period_revenue
4	Tipping point is reported	minimum residual-correlation distance to sign reversal
5	Synthetic oracle variables are clearly labeled when present	executive_attention is teaching-only
6	Communication posture matches sensitivity gates	report_with_caution_and_sensitivity_limitations
7	AI summary is audited	score_sensitivity_review

17. Exercises

Increase hidden_strength in the simulator and rerun the sensitivity report. How do the placebo and oracle gaps change?
Change the residual-confounding grid from correlations up to 0.60 to correlations up to 0.30. Does a sign reversal still appear?
Add a second placebo outcome that is less related to executive_attention. How should interpretation differ?
Modify the gate logic so a large placebo effect automatically triggers collect_more_evidence.
Ask one local model to summarize the report without explicit warnings. Does it overclaim robustness?
Replace the continuous outcome with a binary renewal outcome and discuss which sensitivity methods would need to change.

Key Takeaways

Sensitivity analysis is how we keep causal humility operational.

A strong sensitivity workflow has this shape:

State the estimand and adjustment set.
Estimate the main effect with uncertainty.
Benchmark against observed covariates.
Run placebo or negative-control checks when available.
Use a residual-confounding grid or other method-specific sensitivity analysis.
Translate the evidence into a communication posture.
Use AI to summarize the evidence, then audit whether it stayed cautious.

The practical rule is simple: do not ask whether the result is robust; ask how fragile it is under specific, named violations of the assumptions.