DoWhy Tutorial 04: Regression, Matching, And Propensity Estimators
This notebook compares several estimators for the same identified causal estimand. The goal is not to crown one estimator as universally best. The goal is to understand what each estimator is doing, what assumptions it leans on, and why estimates can differ even when the causal graph is unchanged.
We will use one observational teaching dataset, one causal graph, and one target estimand: the average total effect of feature_exposure on weekly_value. Then we will estimate that target with outcome regression, propensity-score matching, propensity-score stratification, and propensity-score weighting.
Learning Goals
By the end of this notebook, you should be able to:
Explain the difference between an estimand and an estimator.
Run manual versions of regression, matching, stratification, and weighting estimators.
Run the corresponding DoWhy estimators for a common backdoor estimand.
Diagnose basic estimator behavior using propensity overlap, covariate balance, and estimate stability.
Explain why estimator agreement is useful but does not prove the causal graph is correct.
Estimand Versus Estimator
The estimand is the causal quantity implied by the graph and assumptions. In this notebook, the estimand is the average treatment effect after adjusting for observed common causes.
The estimator is the statistical method used to compute that estimand from finite data. Regression, matching, stratification, and weighting can all target the same estimand, but they do so with different statistical machinery.
Setup
This setup cell imports the packages used in the notebook, creates output folders, fixes a random seed, and suppresses known third-party compatibility warnings. The warning policy keeps expected library chatter out of the student-facing notebook while preserving real execution errors.
from pathlib import Pathimport osimport platformimport sysimport warningsSTART_DIR = Path.cwd().resolve()PROJECT_ROOT =next( (candidate for candidate in [START_DIR, *START_DIR.parents] if (candidate /"pyproject.toml").exists()), START_DIR,)NOTEBOOK_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"dowhy"OUTPUT_DIR = NOTEBOOK_DIR /"outputs"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"CACHE_DIR = PROJECT_ROOT /".cache"/"matplotlib"for directory in [OUTPUT_DIR, FIGURE_DIR, TABLE_DIR, CACHE_DIR]: directory.mkdir(parents=True, exist_ok=True)os.environ.setdefault("MPLCONFIGDIR", str(CACHE_DIR))warnings.filterwarnings("default")warnings.filterwarnings("ignore", category=DeprecationWarning)warnings.filterwarnings("ignore", category=PendingDeprecationWarning)warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message=".*IProgress not found.*")warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")warnings.filterwarnings("ignore", message=".*disp.*iprint.*L-BFGS-B.*")warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")warnings.filterwarnings("ignore", module="seaborn.categorical")warnings.filterwarnings("ignore", module="pydot.dot_parser")import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport networkx as nximport statsmodels.formula.api as smffrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_scorefrom sklearn.neighbors import NearestNeighborsfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerimport dowhyfrom dowhy import CausalModelRANDOM_SEED =41rng = np.random.default_rng(RANDOM_SEED)sns.set_theme(style="whitegrid", context="notebook")print(f"Python executable: {sys.executable}")print(f"Python version: {platform.python_version()}")print(f"DoWhy version: {getattr(dowhy, '__version__', 'unknown')}")print(f"Notebook directory: {NOTEBOOK_DIR}")print(f"Output directory: {OUTPUT_DIR}")
The notebook is ready if this cell prints a DoWhy version. All generated artifacts from this notebook use a 04_ prefix.
Estimator Map
Before creating data, it helps to name the estimator families. This table gives a student-friendly summary of what each method is trying to do.
estimator_map = pd.DataFrame( [ {"estimator_family": "Outcome regression","basic_idea": "Model the outcome as a function of treatment and confounders.","main_strength": "Simple and interpretable when the outcome model is credible.","main_failure_mode": "Can be biased if the outcome model is misspecified.", }, {"estimator_family": "Propensity-score matching","basic_idea": "Compare treated and untreated units with similar treatment probabilities.","main_strength": "Makes treated-control comparability intuitive.","main_failure_mode": "Can be noisy or biased when matches are poor or overlap is weak.", }, {"estimator_family": "Propensity-score stratification","basic_idea": "Compare treated and untreated units within propensity-score strata.","main_strength": "Easy to inspect and explain.","main_failure_mode": "Sensitive to bin choices and sparse cells.", }, {"estimator_family": "Propensity-score weighting","basic_idea": "Reweight units by inverse treatment probability to reduce observed confounding.","main_strength": "Targets population-level effects directly when propensities are reliable.","main_failure_mode": "Extreme weights can create high variance and fragile estimates.", }, ])estimator_map.to_csv(TABLE_DIR /"04_estimator_map.csv", index=False)estimator_map
estimator_family
basic_idea
main_strength
main_failure_mode
0
Outcome regression
Model the outcome as a function of treatment a...
Simple and interpretable when the outcome mode...
Can be biased if the outcome model is misspeci...
1
Propensity-score matching
Compare treated and untreated units with simil...
Makes treated-control comparability intuitive.
Can be noisy or biased when matches are poor o...
2
Propensity-score stratification
Compare treated and untreated units within pro...
Easy to inspect and explain.
Sensitive to bin choices and sparse cells.
3
Propensity-score weighting
Reweight units by inverse treatment probabilit...
Targets population-level effects directly when...
Extreme weights can create high variance and f...
All four methods still rely on the same causal assumption: after adjusting for observed common causes, treatment is as-if randomized. The estimator does not rescue a wrong graph.
Causal Question And Variable Roles
The teaching question is:
What is the average total effect of feature_exposure on weekly_value?
The variables below are all pre-treatment common causes except the treatment, outcome, and simulation-only treatment probability.
role_table = pd.DataFrame( [ {"variable": "feature_exposure", "role": "treatment", "timing": "treatment time", "used_for_adjustment": "no; this is the treatment"}, {"variable": "weekly_value", "role": "outcome", "timing": "future outcome window", "used_for_adjustment": "no; this is the outcome"}, {"variable": "user_engagement", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"}, {"variable": "prior_sessions", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"}, {"variable": "account_age_weeks", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"}, {"variable": "is_power_user", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"}, {"variable": "baseline_value", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"}, {"variable": "treatment_probability", "role": "simulation diagnostic", "timing": "known only because this is simulated", "used_for_adjustment": "no; not a real observed column"}, ])role_table.to_csv(TABLE_DIR /"04_variable_roles.csv", index=False)role_table
variable
role
timing
used_for_adjustment
0
feature_exposure
treatment
treatment time
no; this is the treatment
1
weekly_value
outcome
future outcome window
no; this is the outcome
2
user_engagement
observed common cause
pre-treatment
yes
3
prior_sessions
observed common cause
pre-treatment
yes
4
account_age_weeks
observed common cause
pre-treatment
yes
5
is_power_user
observed common cause
pre-treatment
yes
6
baseline_value
observed common cause
pre-treatment
yes
7
treatment_probability
simulation diagnostic
known only because this is simulated
no; not a real observed column
The adjustment set is deliberately straightforward: all adjustment variables are pre-treatment common causes. This lets the notebook focus on estimator mechanics rather than bad-control mistakes.
Create A Teaching Dataset
This dataset is observational: treatment assignment depends on baseline variables. The true effect is constant and known because this is a simulation. That gives us a benchmark for comparing estimators.
The first rows show the treatment, outcome, common causes, and known assignment probability. In real data, the true treatment probability would usually be unknown, so we will also estimate a propensity model.
Basic Dataset Checks
Before fitting estimators, inspect treatment prevalence, outcome scale, covariate ranges, and the treatment-probability range.
The treatment rate is close enough to balanced for all four estimator families to be meaningful. The treatment-probability range shows some tails, which is useful for teaching weighting and matching behavior.
Confirm Observed Confounding
This plot shows that treatment assignment is not random. Exposed users tend to have different baseline engagement and baseline value.
The treated and untreated groups differ before treatment. That is why all estimators in this notebook target an adjusted effect rather than a raw difference in means.
Baseline Imbalance Table
Standardized mean differences summarize baseline imbalance in standard-deviation units. We will use the same metric later to compare raw and weighted balance.
Large absolute values mean exposed and unexposed users differ on that covariate. Matching, stratification, and weighting are all different attempts to reduce the consequences of this imbalance.
Naive Difference And Outcome Regression
We start with two simple baselines: the naive treated-minus-control difference and an adjusted regression. The adjusted regression controls for the observed common causes named in the graph.
naive_effect = ( estimator_df.loc[estimator_df["feature_exposure"] ==1, "weekly_value"].mean()- estimator_df.loc[estimator_df["feature_exposure"] ==0, "weekly_value"].mean())regression_formula = ("weekly_value ~ feature_exposure + user_engagement + prior_sessions ""+ account_age_weeks + is_power_user + baseline_value")regression_fit = smf.ols(formula=regression_formula, data=estimator_df).fit()regression_effect = regression_fit.params["feature_exposure"]regression_ci = regression_fit.conf_int().loc["feature_exposure"].to_numpy()regression_baseline = pd.DataFrame( [ {"estimator": "known_true_ate", "estimate": TRUE_ATE, "ci_95_lower": np.nan, "ci_95_upper": np.nan, "description": "Known only because this is a teaching simulation."}, {"estimator": "naive_difference_in_means", "estimate": naive_effect, "ci_95_lower": np.nan, "ci_95_upper": np.nan, "description": "Unadjusted association; confounding is not blocked."}, {"estimator": "adjusted_outcome_regression", "estimate": regression_effect, "ci_95_lower": regression_ci[0], "ci_95_upper": regression_ci[1], "description": "Backdoor adjustment through a linear outcome model."}, ])regression_baseline.to_csv(TABLE_DIR /"04_regression_baseline_estimates.csv", index=False)regression_baseline
estimator
estimate
ci_95_lower
ci_95_upper
description
0
known_true_ate
1.650000
NaN
NaN
Known only because this is a teaching simulation.
1
naive_difference_in_means
3.442531
NaN
NaN
Unadjusted association; confounding is not blo...
2
adjusted_outcome_regression
1.675725
1.600592
1.750857
Backdoor adjustment through a linear outcome m...
The naive difference is inflated because exposed users are stronger at baseline. The adjusted regression moves much closer to the known ATE because it blocks the observed backdoor paths.
Estimate Propensity Scores
Propensity-score methods first model the probability of treatment given observed common causes. The propensity score compresses multiple confounders into one balancing score.
This cell estimates propensities with logistic regression. We clip very small or very large estimated probabilities to keep manual weighting stable for teaching.
A high AUC means the observed covariates are predictive of treatment. That is not automatically good or bad; it tells us treatment was not random. Very extreme propensities would be a warning for matching and weighting.
Propensity Overlap
This plot shows whether treated and untreated users occupy similar regions of the estimated propensity-score distribution. Strong overlap makes propensity methods more credible; weak overlap means some units have poor comparisons.
The distributions overlap enough for this teaching example, but the tails are worth noticing. The next notebook will go deeper on overlap, common support, and extreme weights.
Manual Matching Estimator
Matching creates treated-control comparisons among units with similar propensity scores. Here we use one-nearest-neighbor matching on the estimated propensity score.
To approximate an ATE rather than only an ATT, we match treated units to controls and controls to treated units, then average the implied individual contrasts.
The matching estimate should be much closer to the true effect than the naive difference. The match-distance rows are simple diagnostics: smaller distances mean treated and untreated units were compared at more similar propensities.
Manual Stratification Estimator
Stratification divides the sample into propensity-score bins, compares treated and untreated users inside each bin, and averages those within-bin differences.
This is a useful bridge between matching and weighting because it makes the balancing logic very visible.
Each stratum creates a local treated-versus-control comparison among units with similar estimated propensities. If a stratum has very few treated or control units, the estimate for that stratum becomes fragile.
Manual Weighting Estimator
Inverse propensity weighting gives each observed unit a weight based on how surprising its treatment status was. Treated units get weight 1 / e(X), and untreated units get weight 1 / (1 - e(X)).
This cell computes both a plain IPW ATE and a normalized, or self-normalized, IPW estimate.
The plain IPW and normalized IPW estimates can differ because normalization reduces sensitivity to weight scale. The maximum and p99 weights are early warnings about variance; the next notebook will inspect weights more deeply.
Covariate Balance After Weighting
A weighting estimator should reduce imbalance in observed confounders. This table compares raw standardized mean differences to IPW-weighted standardized mean differences.
The adjusted estimators should be much closer to the known ATE than the naive comparison. They will not be identical because each method uses a different finite-sample approximation.
Plot Manual Estimator Comparison
This plot makes the estimator comparison easier to scan. The dashed vertical line marks the known ATE.
Estimator disagreement is not automatically a failure. It is a diagnostic prompt: inspect overlap, model specification, matching quality, and whether all estimators are really targeting the same population estimand.
Create The DoWhy Causal Graph
Now we move from manual estimators to DoWhy estimators. The graph states that all baseline variables are common causes of treatment and outcome.
This is the same backdoor graph for every DoWhy estimator below. The graph and estimand stay fixed; only the statistical estimator changes.
Visualize The Estimator Graph
This graph is intentionally simpler than the previous notebook’s graph. Every baseline covariate is a common cause, and the treatment points to the outcome.
The detected common causes match the graph. That is the key checkpoint before comparing estimators.
Print The Identified Estimand
The estimand output is verbose, but it is the causal heart of the notebook. It states what must be true before any of the estimators can be read causally.
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d ↪
──────────────────(E[weekly_value|user_engagement,prior_sessions,account_age_w ↪
d[featureₑₓₚₒₛᵤᵣₑ] ↪
↪
↪ eeks,baseline_value,is_power_user])
↪
Estimand assumption 1, Unconfoundedness: If U→{feature_exposure} and U→weekly_value then P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user,U) = P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
d ↪
──────────────────(E[weekly_value|user_engagement,prior_sessions,account_age_w ↪
d[featureₑₓₚₒₛᵤᵣₑ] ↪
↪
↪ eeks,baseline_value,is_power_user])
↪
Estimand assumption 1, Unconfoundedness: If U→{feature_exposure} and U→weekly_value then P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user,U) = P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user)
The estimand is a backdoor-adjusted average treatment effect. The next cell changes only the estimator, not the causal target.
Run DoWhy Estimators
DoWhy provides estimator names for regression, matching, stratification, and weighting. This cell runs all four against the same identified estimand.
The DoWhy estimates should be near the true effect and much closer than the naive association. Differences across estimators reflect finite-sample behavior and estimator-specific modeling choices.
Manual And DoWhy Estimates Side By Side
This table combines the manual and DoWhy estimates. The exact manual and DoWhy values may differ because the internal implementations are not identical, but they should tell a similar causal story.
The combined table is the main result of the notebook. A good causal workflow asks whether the estimator family changes the conclusion enough to affect the decision.
Plot All Estimator Results
This plot compares all estimates against the known true ATE. In real data the true ATE is unavailable, but this plot style is still useful with a reference estimate or sensitivity band.
plot_all = all_estimator_comparison.query("estimator != 'known_true_ate'").copy()fig, ax = plt.subplots(figsize=(12, 7))sns.scatterplot( data=plot_all, x="estimate", y="estimator", hue="source", s=90, ax=ax,)ax.axvline(TRUE_ATE, color="#111827", linestyle="--", linewidth=1.4, label="Known true ATE")ax.set_title("Manual And DoWhy Estimators For The Same Backdoor Estimand")ax.set_xlabel("Estimated effect on weekly_value")ax.set_ylabel("")ax.legend(loc="lower right")plt.tight_layout()fig.savefig(FIGURE_DIR /"04_all_estimator_comparison.png", dpi=160, bbox_inches="tight")plt.show()
The adjusted estimators cluster near the true effect, while the naive estimate is far away. This is the central practical lesson: the causal graph and adjustment logic matter more than the surface sophistication of the estimator.
Estimator Choice Guide
This table turns the comparison into practical guidance. It is not a rigid rulebook, but it helps students choose a reasonable first estimator and know what to check next.
estimator_choice_guide = pd.DataFrame( [ {"situation": "Need a transparent first estimate","reasonable_start": "Outcome regression","what_to_check": "Model specification, residual patterns, and whether treatment effect is plausibly constant.", }, {"situation": "Want intuitive treated-control comparisons","reasonable_start": "Propensity-score matching","what_to_check": "Match distances, unmatched regions, and whether matching changes the target population.", }, {"situation": "Want a simple propensity-based diagnostic table","reasonable_start": "Propensity-score stratification","what_to_check": "Treated/control counts and effect stability inside strata.", }, {"situation": "Want a population-level weighted contrast","reasonable_start": "Propensity-score weighting","what_to_check": "Overlap, extreme weights, and weighted covariate balance.", }, {"situation": "Estimators disagree materially","reasonable_start": "Do not average them blindly","what_to_check": "Graph assumptions, overlap, model misspecification, outliers, and target population differences.", }, ])estimator_choice_guide.to_csv(TABLE_DIR /"04_estimator_choice_guide.csv", index=False)estimator_choice_guide
situation
reasonable_start
what_to_check
0
Need a transparent first estimate
Outcome regression
Model specification, residual patterns, and wh...
1
Want intuitive treated-control comparisons
Propensity-score matching
Match distances, unmatched regions, and whethe...
2
Want a simple propensity-based diagnostic table
Propensity-score stratification
Treated/control counts and effect stability in...
3
Want a population-level weighted contrast
Propensity-score weighting
Overlap, extreme weights, and weighted covaria...
4
Estimators disagree materially
Do not average them blindly
Graph assumptions, overlap, model misspecifica...
The right response to estimator disagreement is investigation, not automatic ensemble averaging. Estimators are diagnostic lenses on the same causal design.
Final Summary
This final table gives a compact report-ready summary of the notebook’s causal result.
final_summary = pd.DataFrame( [ {"item": "Causal question", "summary": "Average total effect of feature exposure on weekly value."}, {"item": "Identified estimand", "summary": "Backdoor-adjusted ATE using observed pre-treatment common causes."}, {"item": "Known true ATE", "summary": f"{TRUE_ATE:.3f}"}, {"item": "Naive estimate", "summary": f"{naive_effect:.3f}; inflated because treated users are stronger at baseline."}, {"item": "Adjusted regression estimate", "summary": f"{regression_effect:.3f}"}, {"item": "Manual matching estimate", "summary": f"{manual_matching_ate:.3f}"}, {"item": "Manual stratification estimate", "summary": f"{manual_stratification_ate:.3f}"}, {"item": "Manual normalized IPW estimate", "summary": f"{manual_normalized_ipw_ate:.3f}"}, {"item": "Main diagnostic", "summary": "Adjusted estimators cluster near the true effect and improve sharply over the naive comparison."}, {"item": "Main limitation", "summary": "All estimates still depend on the observed-confounding assumption and adequate overlap."}, ])final_summary.to_csv(TABLE_DIR /"04_final_estimator_summary.csv", index=False)final_summary
item
summary
0
Causal question
Average total effect of feature exposure on we...
1
Identified estimand
Backdoor-adjusted ATE using observed pre-treat...
2
Known true ATE
1.650
3
Naive estimate
3.443; inflated because treated users are stro...
4
Adjusted regression estimate
1.676
5
Manual matching estimate
1.732
6
Manual stratification estimate
1.766
7
Manual normalized IPW estimate
1.753
8
Main diagnostic
Adjusted estimators cluster near the true effe...
9
Main limitation
All estimates still depend on the observed-con...
The final summary separates the causal claim from the estimator mechanics. That is the habit to preserve: report the graph assumptions, the estimand, the estimators, and the diagnostics together.
Student Exercises
Try these after running the notebook:
Increase the strength of confounding in the treatment assignment equation and watch the naive estimate move farther away.
Reduce sample size and see which estimators become noisier.
Change the number of propensity strata from 10 to 5 or 20.
Match on all standardized covariates instead of only the propensity score.
Remove clipping in the IPW estimator and inspect the maximum weight.
Add a nonlinear outcome term and compare how regression and propensity estimators respond.
Closing Notes
This notebook showed that multiple estimators can target the same identified causal effect. Regression, matching, stratification, and weighting are different ways to operationalize the same backdoor logic.
The next tutorial will focus more deeply on weighting, overlap, common support, and why extreme propensity scores can make causal estimates fragile.