DoWhy Tutorial 14: End-To-End Observational Case Study
This notebook is a compact capstone for the DoWhy tutorial series. Instead of focusing on one API feature, it walks through a full observational causal workflow from a raw-looking table to a report-ready causal summary.
The case study asks:
What is the total effect of proactive guidance on future engagement?
The data are synthetic so we can compare estimates to the known truth, but the workflow is written the way an applied analysis should be structured:
define the business or policy question before modeling;
document variables, timing, and graph assumptions;
inspect confounding, overlap, and missingness;
identify the estimand with DoWhy;
compare estimators;
run refutation and negative-control checks;
avoid bad controls;
summarize the result with limitations and recommended next steps.
Learning Goals
By the end of this notebook, you should be able to:
Turn a realistic observational table into a causal analysis-ready dataset.
Write a causal question as treatment, outcome, population, estimand, and time horizon.
Build a graph that separates pre-treatment covariates, treatment, post-treatment mediators, and outcomes.
Use DoWhy to identify and estimate an average total effect.
Compare regression and propensity-based estimators.
Diagnose propensity overlap, covariate balance, and bad-control mistakes.
Run DoWhy refuters and negative-control checks.
Produce a concise final causal report that does not overstate the evidence.
Case Study Brief
Imagine an organization has launched proactive guidance: a targeted intervention that offers extra help, education, or nudges to users who may benefit from it. The observed data are not randomized. Users with higher need or friction are more likely to receive guidance, and those same factors also affect later engagement.
The target estimand is the average total effect of proactive guidance on future engagement. Total effect means we allow downstream activation depth to change naturally after guidance. We do not adjust for activation depth when estimating the total effect, because activation depth is a post-treatment mediator.
Setup
This cell imports the libraries, creates output folders, and suppresses non-actionable warnings that would otherwise make the executed tutorial noisy. Code remains visible throughout the notebook.
from pathlib import Pathimport osimport warnings# Keep Matplotlib cache files in a writable location during notebook execution.os.environ.setdefault("MPLCONFIGDIR", "/tmp/matplotlib-ranking-sys")warnings.filterwarnings("default")warnings.filterwarnings("ignore", category=DeprecationWarning)warnings.filterwarnings("ignore", category=PendingDeprecationWarning)warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message=".*IProgress not found.*")warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")warnings.filterwarnings("ignore", message=".*variables are assumed unobserved.*")warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")warnings.filterwarnings("ignore", module="pydot.dot_parser")import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport networkx as nximport statsmodels.api as smfrom IPython.display import displayfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import make_pipelinefrom sklearn.metrics import roc_auc_scorefrom dowhy import CausalModelimport dowhyRANDOM_SEED =2026rng = np.random.default_rng(RANDOM_SEED)OUTPUT_DIR = Path("outputs")FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)sns.set_theme(style="whitegrid", context="notebook")pd.set_option("display.max_columns", 100)pd.set_option("display.float_format", lambda value: f"{value:,.4f}")print(f"DoWhy version: {getattr(dowhy, '__version__', 'unknown')}")print(f"Figures will be saved to: {FIGURE_DIR.resolve()}")print(f"Tables will be saved to: {TABLE_DIR.resolve()}")
DoWhy version: 0.14
Figures will be saved to: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy/outputs/figures
Tables will be saved to: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy/outputs/tables
The notebook is ready once the DoWhy version and output folders print. Every saved artifact uses a 14_ prefix so the capstone outputs are easy to find.
End-To-End Workflow Map
A complete causal analysis is not just an estimator call. The table below gives the workflow followed in this notebook and the purpose of each stage.
workflow_map = pd.DataFrame( [ {"stage": "Question framing","main output": "causal question table","why it matters": "Prevents estimator shopping by naming treatment, outcome, population, and estimand first.", }, {"stage": "Data understanding","main output": "field guide and quality checks","why it matters": "Clarifies variable timing and catches obvious data issues.", }, {"stage": "Confounding diagnostics","main output": "balance and propensity overlap checks","why it matters": "Shows why adjustment is needed and whether comparable treated/control units exist.", }, {"stage": "Graph and identification","main output": "DAG and DoWhy estimand","why it matters": "Turns assumptions into an explicit adjustment strategy.", }, {"stage": "Estimation","main output": "multi-estimator comparison","why it matters": "Checks whether conclusions depend heavily on one estimator.", }, {"stage": "Stress testing","main output": "refuters, negative controls, sensitivity table","why it matters": "Tests whether the estimate behaves sensibly under deliberate perturbations.", }, {"stage": "Reporting","main output": "decision summary and limitations","why it matters": "Makes the result useful without pretending observational evidence is a randomized experiment.", }, ])workflow_map.to_csv(TABLE_DIR /"14_workflow_map.csv", index=False)display(workflow_map)
stage
main output
why it matters
0
Question framing
causal question table
Prevents estimator shopping by naming treatmen...
1
Data understanding
field guide and quality checks
Clarifies variable timing and catches obvious ...
2
Confounding diagnostics
balance and propensity overlap checks
Shows why adjustment is needed and whether com...
3
Graph and identification
DAG and DoWhy estimand
Turns assumptions into an explicit adjustment ...
4
Estimation
multi-estimator comparison
Checks whether conclusions depend heavily on o...
5
Stress testing
refuters, negative controls, sensitivity table
Tests whether the estimate behaves sensibly un...
6
Reporting
decision summary and limitations
Makes the result useful without pretending obs...
The workflow moves from assumptions to diagnostics to estimates to reporting. That ordering is part of the discipline of causal work.
Causal Question Table
The next table translates the case-study brief into a concrete estimand. This is the kind of table that should appear near the top of any applied causal notebook.
causal_question = pd.DataFrame( [ {"component": "Population","definition": "Users or units eligible for proactive guidance in the observational table.", }, {"component": "Treatment","definition": "`proactive_guidance`: whether the unit received guidance during the exposure window.", }, {"component": "Outcome","definition": "`future_engagement`: later engagement measured after the exposure and activation windows.", }, {"component": "Estimand","definition": "Average total effect of guidance versus no guidance.", }, {"component": "Adjustment principle","definition": "Adjust for pre-treatment confounders, not post-treatment activation depth.", }, {"component": "Main limitation","definition": "Credibility depends on no unmeasured confounding after observed pre-treatment covariates.", }, ])causal_question.to_csv(TABLE_DIR /"14_causal_question.csv", index=False)display(causal_question)
component
definition
0
Population
Users or units eligible for proactive guidance...
1
Treatment
`proactive_guidance`: whether the unit receive...
2
Outcome
`future_engagement`: later engagement measured...
3
Estimand
Average total effect of guidance versus no gui...
4
Adjustment principle
Adjust for pre-treatment confounders, not post...
5
Main limitation
Credibility depends on no unmeasured confoundi...
The estimand is intentionally the total effect. If we adjusted for activation depth, we would block part of the effect that guidance may create.
Data Field Guide
This case study uses a realistic observational table with pre-treatment covariates, one treatment, one post-treatment mediator, the outcome, and hidden truth columns used only for teaching checks.
field_guide = pd.DataFrame( [ {"column": "baseline_need","role": "pre-treatment confounder","plain meaning": "Need or intent measured before guidance assignment.","use in analysis": "Adjust for it.", }, {"column": "prior_engagement","role": "pre-treatment confounder","plain meaning": "Historical engagement before guidance assignment.","use in analysis": "Adjust for it.", }, {"column": "account_tenure_z","role": "pre-treatment confounder","plain meaning": "Standardized account maturity.","use in analysis": "Adjust for it.", }, {"column": "friction_score","role": "pre-treatment confounder","plain meaning": "Pre-existing friction or difficulty before guidance.","use in analysis": "Adjust for it.", }, {"column": "region_risk","role": "pre-treatment confounder","plain meaning": "Binary context indicator for harder operating conditions.","use in analysis": "Adjust for it.", }, {"column": "high_need_segment","role": "pre-treatment effect modifier and confounder","plain meaning": "Whether baseline need is above zero.","use in analysis": "Adjust for it and inspect heterogeneity.", }, {"column": "proactive_guidance","role": "treatment","plain meaning": "Whether guidance was delivered.","use in analysis": "Main causal lever.", }, {"column": "activation_depth","role": "post-treatment mediator","plain meaning": "How deeply the user activated after guidance.","use in analysis": "Do not adjust for it when estimating the total effect.", }, {"column": "future_engagement","role": "outcome","plain meaning": "Later engagement measured after treatment and mediator.","use in analysis": "Main target outcome.", }, {"column": "pre_period_satisfaction","role": "negative-control outcome","plain meaning": "A pre-treatment outcome-like measure guidance cannot causally affect.","use in analysis": "Use as a falsification check.", }, ])field_guide.to_csv(TABLE_DIR /"14_field_guide.csv", index=False)display(field_guide)
column
role
plain meaning
use in analysis
0
baseline_need
pre-treatment confounder
Need or intent measured before guidance assign...
Adjust for it.
1
prior_engagement
pre-treatment confounder
Historical engagement before guidance assignment.
Adjust for it.
2
account_tenure_z
pre-treatment confounder
Standardized account maturity.
Adjust for it.
3
friction_score
pre-treatment confounder
Pre-existing friction or difficulty before gui...
Adjust for it.
4
region_risk
pre-treatment confounder
Binary context indicator for harder operating ...
Adjust for it.
5
high_need_segment
pre-treatment effect modifier and confounder
Whether baseline need is above zero.
Adjust for it and inspect heterogeneity.
6
proactive_guidance
treatment
Whether guidance was delivered.
Main causal lever.
7
activation_depth
post-treatment mediator
How deeply the user activated after guidance.
Do not adjust for it when estimating the total...
8
future_engagement
outcome
Later engagement measured after treatment and ...
Main target outcome.
9
pre_period_satisfaction
negative-control outcome
A pre-treatment outcome-like measure guidance ...
Use as a falsification check.
The field guide separates adjustment variables from post-treatment variables. That distinction is the difference between estimating the intended total effect and accidentally estimating a direct-effect-like coefficient.
Simulate The Observational Table
The next cell creates the teaching data. Treatment assignment depends on the same pre-treatment variables that also affect the outcome, so a naive treated-versus-control comparison will be confounded.
The simulation also stores the true propensity and true unit-level total effect. Those columns are used only for teaching diagnostics and are excluded from the DoWhy model.
The analyst-facing dataset looks like a normal observational table. The hidden truth columns are saved separately so we can judge estimator behavior in this tutorial.
Basic Shape And Quality Checks
Before causal modeling, inspect row count, columns, missingness, and treatment rate. These checks are not glamorous, but they prevent many avoidable mistakes.
There are no missing values, and the treatment rate is not extreme. That is a good start, but it does not mean treated and control units are comparable.
Outcome And Treatment Overview
A quick treated-versus-control summary shows the raw association. Because treatment is observational, this is not yet a causal estimate.
The raw difference is larger than the known true effect in this simulation. That is the expected sign of confounding: higher-need units are more likely to receive guidance and also have different outcome prospects.
Plot Raw Outcome Distributions
Distribution plots show more than the mean difference. Here we compare future engagement for treated and control units before any adjustment.
The distributions differ visibly, but this plot alone cannot tell us how much of that gap is causal. The next step is to inspect pre-treatment imbalance.
Covariate Balance Before Adjustment
For each pre-treatment covariate, we compute treated and control means plus the standardized mean difference. Large standardized differences indicate selection into treatment.
Several covariates are imbalanced, especially baseline need, prior engagement, friction, and account tenure. This confirms that adjustment is necessary.
Plot Raw Covariate Balance
The dashed reference lines at +/-0.1 are common rough guides. Values outside that band suggest meaningful imbalance.
The plot makes the selection problem concrete. We need a graph and adjustment strategy before treating outcome differences as causal.
Estimate Propensity Scores For Diagnostics
Even when the final estimator is not propensity weighting, propensity scores are useful diagnostics. They show whether treated and control units have overlapping assignment probabilities.
The propensity model has meaningful predictive power, which is another sign that treatment was not assigned randomly. The next question is whether there is enough overlap for comparison.
Plot Propensity Overlap
Overlap means treated and control units exist at similar propensity levels. Weak overlap makes causal estimates more model-dependent.
There is usable overlap across much of the propensity range, though the tails are thinner. That means weighting and matching are plausible but still need diagnostics.
Weight Diagnostics And Weighted Balance
For inverse-propensity weighting, extreme weights can dominate an estimate. We compute stabilized ATE weights, summarize them, and check whether weighting improves covariate balance.
The weights are not wildly extreme, and the effective sample size remains healthy. That supports using propensity weighting as one estimator and one diagnostic lens.
Balance Before And After Weighting
A good propensity model should reduce pre-treatment imbalance after weighting. This cell compares raw and weighted standardized mean differences.
Weighting substantially reduces the largest imbalances. This does not prove no unmeasured confounding, but it shows the observed covariates are more comparable after weighting.
Plot Balance Improvement
The paired plot below shows each covariate before and after weighting. The goal is to move points closer to zero.
The weighted points cluster near zero. This makes the weighted estimator more credible than the raw treated-control contrast, at least for observed covariates.
Specify The Causal Graph
The graph includes pre-treatment confounders, treatment, the post-treatment mediator, and the outcome. The mediator is included in the graph so the timing is explicit, but it is not included in the total-effect adjustment set.
The graph states the total-effect logic: guidance can affect future engagement directly and indirectly through activation depth. Pre-treatment variables can affect treatment and outcome, so they need adjustment.
Visualize The DAG
The diagram puts pre-treatment covariates on the left, treatment and mediator in the middle, and the outcome on the right. This visual structure helps prevent bad-control mistakes.
CausalModel created for proactive guidance -> future engagement.
The model now knows the graph and variable roles. The next step is identification: finding a valid adjustment formula under the graph assumptions.
Identify The Total Effect
DoWhy identifies the estimand implied by the graph. For this total-effect question, the valid backdoor adjustment set should include pre-treatment covariates and exclude activation depth.
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d ↪
─────────────────────(E[future_engagement|account_tenure_z,friction_score,prio ↪
d[proactive_guidance] ↪
↪
↪ r_engagement,high_need_segment,region_risk,baseline_need])
↪
Estimand assumption 1, Unconfoundedness: If U→{proactive_guidance} and U→future_engagement then P(future_engagement|proactive_guidance,account_tenure_z,friction_score,prior_engagement,high_need_segment,region_risk,baseline_need,U) = P(future_engagement|proactive_guidance,account_tenure_z,friction_score,prior_engagement,high_need_segment,region_risk,baseline_need)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
d ↪
─────────────────────(E[future_engagement|account_tenure_z,friction_score,prio ↪
d[proactive_guidance] ↪
↪
↪ r_engagement,high_need_segment,region_risk,baseline_need])
↪
Estimand assumption 1, Unconfoundedness: If U→{proactive_guidance} and U→future_engagement then P(future_engagement|proactive_guidance,account_tenure_z,friction_score,prior_engagement,high_need_segment,region_risk,baseline_need,U) = P(future_engagement|proactive_guidance,account_tenure_z,friction_score,prior_engagement,high_need_segment,region_risk,baseline_need)
The printed backdoor estimand lists pre-treatment variables as adjustment variables. Activation depth does not appear in the adjustment set because it is downstream of treatment.
Manual Baseline Estimates
Before using DoWhy estimators, compute three simple manual comparisons: raw difference, adjusted regression, and stabilized IPW. This makes the DoWhy estimates easier to audit.
# Raw treated-control difference.manual_raw_difference = raw_difference# Adjusted regression for the total effect: do not include the mediator.adjusted_design = sm.add_constant(analyst_df[[TREATMENT] + PRE_TREATMENT_COVARIATES])adjusted_regression = sm.OLS(analyst_df[OUTCOME], adjusted_design).fit()manual_adjusted_regression = adjusted_regression.params[TREATMENT]# Stabilized IPW estimate.treated_weighted_mean = np.average( analyst_df.loc[analyst_df[TREATMENT] ==1, OUTCOME], weights=analyst_df.loc[analyst_df[TREATMENT] ==1, "stabilized_weight"],)control_weighted_mean = np.average( analyst_df.loc[analyst_df[TREATMENT] ==0, OUTCOME], weights=analyst_df.loc[analyst_df[TREATMENT] ==0, "stabilized_weight"],)manual_ipw = treated_weighted_mean - control_weighted_meanmanual_estimates = pd.DataFrame( [ {"estimator": "raw difference", "estimate": manual_raw_difference, "notes": "confounded association"}, {"estimator": "adjusted OLS", "estimate": manual_adjusted_regression, "notes": "adjusts for pre-treatment covariates"}, {"estimator": "stabilized IPW", "estimate": manual_ipw, "notes": "weights by estimated propensity"}, ])manual_estimates["known_true_ate"] = true_atemanual_estimates["absolute_error"] = (manual_estimates["estimate"] - true_ate).abs()manual_estimates.to_csv(TABLE_DIR /"14_manual_estimates.csv", index=False)display(manual_estimates)
estimator
estimate
notes
known_true_ate
absolute_error
0
raw difference
1.3247
confounded association
0.8496
0.4751
1
adjusted OLS
0.8352
adjusts for pre-treatment covariates
0.8496
0.0144
2
stabilized IPW
0.8467
weights by estimated propensity
0.8496
0.0029
The adjusted and weighted estimates move closer to the known effect than the raw difference. That is the pattern we want in this teaching case.
DoWhy Estimator Comparison
Now we estimate the same identified estimand with several DoWhy estimators. Agreement across reasonable estimators is not proof, but it improves confidence that the result is not an artifact of one method.
The DoWhy estimates cluster near the known true effect. In real data, the known truth is unavailable, so the clustering itself would be one part of the credibility story.
Plot Estimator Comparison
The plot compares raw, manual adjusted, and DoWhy estimates against the known truth. The truth line is available only because this is a teaching simulation.
The adjusted estimates are much closer to the known effect than the raw difference. That is a clean demonstration of why the causal graph and adjustment set matter.
Bad-Control Demonstration
Activation depth is post-treatment. If we include it in a total-effect regression, we block the mediated pathway and estimate something closer to a direct effect. This cell shows the size of that mistake.
The mediator-adjusted coefficient is smaller because part of the treatment effect flows through activation depth. That is not a bug; it is a different estimand.
Plot The Bad-Control Effect
This chart emphasizes how much the treatment coefficient changes when a post-treatment mediator is added as a control.
The bad-control model would understate the total effect. This is a common applied mistake, and the DAG is the best defense against it.
Segment Heterogeneity
The average effect is useful, but the simulated treatment effect is larger for high-need units. We inspect segment-level effects using the stabilized weights and compare them to the known segment truth.
The high-need segment has the larger effect, which matches the simulated truth. In real observational data, this kind of subgroup result should be treated as exploratory unless pre-specified or validated.
Plot Segment Effects
The plot compares the segment estimates to their known teaching truth. The gap between segments is often more decision-relevant than the overall average.
The segment pattern suggests guidance is more valuable for higher-need units. That is a targeting hypothesis, not a deployment decision by itself.
Negative-Control Outcome Check
A negative-control outcome is something the treatment should not causally affect. Here pre_period_satisfaction is measured before treatment, so guidance cannot cause it. If an adjusted model still finds a large treatment effect on this pre-period outcome, that would suggest residual confounding.
negative_control_naive = sm.OLS( analyst_df[NEGATIVE_CONTROL_OUTCOME], sm.add_constant(analyst_df[[TREATMENT]]),).fit()negative_control_adjusted = sm.OLS( analyst_df[NEGATIVE_CONTROL_OUTCOME], sm.add_constant(analyst_df[[TREATMENT] + PRE_TREATMENT_COVARIATES]),).fit()negative_control_table = pd.DataFrame( [ {"model": "negative control: naive","treatment_coefficient": negative_control_naive.params[TREATMENT],"standard_error": negative_control_naive.bse[TREATMENT],"reading": "raw association before adjustment", }, {"model": "negative control: adjusted","treatment_coefficient": negative_control_adjusted.params[TREATMENT],"standard_error": negative_control_adjusted.bse[TREATMENT],"reading": "should be close to zero if observed confounding is handled", }, ])negative_control_table.to_csv(TABLE_DIR /"14_negative_control_outcome.csv", index=False)display(negative_control_table)
model
treatment_coefficient
standard_error
reading
0
negative control: naive
0.4673
0.0235
raw association before adjustment
1
negative control: adjusted
0.0227
0.0235
should be close to zero if observed confoundin...
The adjusted negative-control coefficient is much closer to zero than the naive coefficient. This supports, but does not prove, the observed adjustment strategy.
Run DoWhy Refuters
Refuters check whether the estimate behaves sensibly under deliberate perturbations. We use the linear regression estimate as the baseline because it is fast, transparent, and close to the known truth in this simulation.
Refute: Add a random common cause
Estimated effect:0.8352181563708903
New effect:0.8352136116020734
p value:0.47544790820359595
Refute: Use a Placebo Treatment
Estimated effect:0.8352181563708903
New effect:0.009719829616254694
p value:0.3728059681347832
Refute: Use a subset of data
Estimated effect:0.8352181563708903
New effect:0.8292451720091083
p value:0.21892525767676085
A stable estimate should not move much when a random covariate is added. A placebo treatment should produce an effect near zero. A subset refuter should produce a similar estimate on repeated subsets.
Summarize Refuter Results
The raw refuter printouts are useful but verbose. This table extracts the main quantities for reporting.
refutation_summary = pd.DataFrame( [ {"check": "random common cause","original_effect": baseline_dowhy_estimate.value,"new_effect": random_common_cause_refutation.new_effect,"desired_behavior": "close to original estimate", }, {"check": "placebo treatment","original_effect": baseline_dowhy_estimate.value,"new_effect": placebo_treatment_refutation.new_effect,"desired_behavior": "near zero", }, {"check": "data subset","original_effect": baseline_dowhy_estimate.value,"new_effect": data_subset_refutation.new_effect,"desired_behavior": "close to original estimate", }, ])refutation_summary["absolute_change_from_original"] = ( refutation_summary["new_effect"] - refutation_summary["original_effect"]).abs()refutation_summary.to_csv(TABLE_DIR /"14_refutation_summary.csv", index=False)display(refutation_summary)
check
original_effect
new_effect
desired_behavior
absolute_change_from_original
0
random common cause
0.8352
0.8352
close to original estimate
0.0000
1
placebo treatment
0.8352
0.0097
near zero
0.8255
2
data subset
0.8352
0.8292
close to original estimate
0.0060
The placebo check is the most important falsification check here: a fake treatment should not reproduce the original effect. The random-cause and subset checks assess stability.
Plot Refuter Results
The plot compares each refuter’s new effect to the original baseline estimate and to zero.
The refuter results behave as expected in this controlled case. That raises confidence in the workflow, while still leaving the usual observational caveats in place.
Hidden-Confounding Sensitivity
The main untestable assumption is no unmeasured confounding. DoWhy’s unobserved-common-cause refuter simulates how the estimate could move under hypothetical hidden confounders of different strengths.
treatment_strength_grid = np.array([0.01, 0.03, 0.05])outcome_strength_grid = np.array([0.05, 0.15, 0.30])hidden_confounder_refutation = case_model.refute_estimate( identified_estimand, baseline_dowhy_estimate, method_name="add_unobserved_common_cause", simulation_method="direct-simulation", confounders_effect_on_treatment="binary_flip", confounders_effect_on_outcome="linear", effect_strength_on_treatment=treatment_strength_grid, effect_strength_on_outcome=outcome_strength_grid, plotmethod=None,)hidden_confounding_matrix = pd.DataFrame( hidden_confounder_refutation.new_effect_array, index=[f"treatment strength {value:.2f}"for value in treatment_strength_grid], columns=[f"outcome strength {value:.2f}"for value in outcome_strength_grid],)hidden_confounding_matrix.to_csv(TABLE_DIR /"14_hidden_confounding_sensitivity_matrix.csv")display(hidden_confounding_matrix)print(f"Original estimate: {baseline_dowhy_estimate.value:.3f}")print(f"Range after simulated hidden confounding: {hidden_confounder_refutation.new_effect}")
outcome strength 0.05
outcome strength 0.15
outcome strength 0.30
treatment strength 0.01
0.8056
0.7739
0.7520
treatment strength 0.03
0.6843
0.6413
0.5860
treatment strength 0.05
0.5301
0.4762
0.4336
Original estimate: 0.835
Range after simulated hidden confounding: (np.float64(0.43357046044109304), np.float64(0.8055849488606437))
The sensitivity table shows that stronger hidden confounding can reduce the estimate. This does not mean such confounding exists, but it clarifies what kind of unobserved bias would threaten the conclusion.
Final Evidence Scorecard
A useful causal report combines estimates with diagnostics. The scorecard below summarizes the main evidence in plain language.
main_dowhy_estimate = baseline_dowhy_estimate.valuebest_propensity_estimate = dowhy_estimates.loc[dowhy_estimates["estimator"] =="propensity_score_weighting", "estimate"].iloc[0]placebo_effect = placebo_treatment_refutation.new_effectadjusted_negative_control = negative_control_table.loc[ negative_control_table["model"] =="negative control: adjusted","treatment_coefficient",].iloc[0]max_weighted_smd = weighted_balance["absolute_weighted_smd"].max()scorecard = pd.DataFrame( [ {"evidence item": "main adjusted DoWhy estimate","result": f"{main_dowhy_estimate:.3f}","reading": "positive estimated total effect after graph-based adjustment", }, {"evidence item": "estimator agreement","result": f"weighting estimate {best_propensity_estimate:.3f}","reading": "propensity and regression estimates tell a similar story", }, {"evidence item": "overlap and balance","result": f"max weighted SMD {max_weighted_smd:.3f}","reading": "observed covariates are much better balanced after weighting", }, {"evidence item": "placebo treatment refuter","result": f"{placebo_effect:.3f}","reading": "fake treatment does not reproduce the main effect", }, {"evidence item": "negative-control outcome","result": f"adjusted coefficient {adjusted_negative_control:.3f}","reading": "pre-period outcome association mostly disappears after adjustment", }, {"evidence item": "hidden-confounding sensitivity","result": f"range {hidden_confounder_refutation.new_effect[0]:.3f} to {hidden_confounder_refutation.new_effect[1]:.3f}","reading": "strong unobserved confounding could still reduce the estimate", }, ])scorecard.to_csv(TABLE_DIR /"14_evidence_scorecard.csv", index=False)display(scorecard)
evidence item
result
reading
0
main adjusted DoWhy estimate
0.835
positive estimated total effect after graph-ba...
1
estimator agreement
weighting estimate 0.854
propensity and regression estimates tell a sim...
2
overlap and balance
max weighted SMD 0.018
observed covariates are much better balanced a...
3
placebo treatment refuter
0.010
fake treatment does not reproduce the main effect
4
negative-control outcome
adjusted coefficient 0.023
pre-period outcome association mostly disappea...
5
hidden-confounding sensitivity
range 0.434 to 0.806
strong unobserved confounding could still redu...
The scorecard supports a positive effect but keeps the observational limitation visible. This is the tone an end-to-end causal report should aim for.
Report-Ready Final Summary
The final table is written in the style of a compact applied report. It names the estimate, diagnostics, limitations, and recommended follow-up.
final_summary = pd.DataFrame( [ {"section": "Causal question","summary": "Estimate the average total effect of proactive guidance on future engagement.", }, {"section": "Primary estimate","summary": f"The main DoWhy adjusted estimate is {main_dowhy_estimate:.3f}; the known teaching ATE is {true_ate:.3f}.", }, {"section": "Estimator robustness","summary": "Regression, matching, stratification, and weighting estimates are directionally consistent.", }, {"section": "Diagnostics","summary": "Observed covariate imbalance is substantial before adjustment and much smaller after weighting.", }, {"section": "Falsification checks","summary": "Placebo and negative-control checks behave as expected in the teaching data.", }, {"section": "Important caveat","summary": "The result still depends on the no-unmeasured-confounding assumption.", }, {"section": "Recommended next step","summary": "Use this analysis to justify a prospective experiment or a more controlled rollout design.", }, ])final_summary.to_csv(TABLE_DIR /"14_final_case_study_summary.csv", index=False)display(final_summary)
section
summary
0
Causal question
Estimate the average total effect of proactive...
1
Primary estimate
The main DoWhy adjusted estimate is 0.835; the...
2
Estimator robustness
Regression, matching, stratification, and weig...
3
Diagnostics
Observed covariate imbalance is substantial be...
4
Falsification checks
Placebo and negative-control checks behave as ...
5
Important caveat
The result still depends on the no-unmeasured-...
6
Recommended next step
Use this analysis to justify a prospective exp...
This summary is intentionally measured. Observational analysis can guide decisions and experiment design, but it should not be reported as if treatment were randomized.
Final Checklist
The last checklist can be reused in other applied DoWhy analyses. It is a compact version of the workflow from this notebook.
reusable_checklist = pd.DataFrame( [ {"check": "Question defined before modeling", "status in this notebook": "done"}, {"check": "Variable timing documented", "status in this notebook": "done"}, {"check": "Pre-treatment confounding inspected", "status in this notebook": "done"}, {"check": "Overlap and balance checked", "status in this notebook": "done"}, {"check": "DAG drawn and encoded", "status in this notebook": "done"}, {"check": "DoWhy estimand identified", "status in this notebook": "done"}, {"check": "Multiple estimators compared", "status in this notebook": "done"}, {"check": "Post-treatment bad control avoided", "status in this notebook": "done"}, {"check": "Refuters and negative controls run", "status in this notebook": "done"}, {"check": "Limitations reported with estimate", "status in this notebook": "done"}, ])reusable_checklist.to_csv(TABLE_DIR /"14_reusable_case_study_checklist.csv", index=False)display(reusable_checklist)
check
status in this notebook
0
Question defined before modeling
done
1
Variable timing documented
done
2
Pre-treatment confounding inspected
done
3
Overlap and balance checked
done
4
DAG drawn and encoded
done
5
DoWhy estimand identified
done
6
Multiple estimators compared
done
7
Post-treatment bad control avoided
done
8
Refuters and negative controls run
done
9
Limitations reported with estimate
done
The checklist is the main reusable artifact from this capstone. The exact estimator may change across applications, but the structure of disciplined causal analysis should remain stable.
Closing Notes
This end-to-end notebook pulled together the major ideas from the earlier DoWhy tutorials:
graph-based adjustment;
propensity overlap and balance;
regression and propensity estimators;
bad-control reasoning;
heterogeneous effects;
refuters and negative controls;
hidden-confounding sensitivity;
report-style summarization.
The next tutorial closes the DoWhy series with common pitfalls, debugging patterns, and reporting habits that make causal notebooks easier to trust.