DoWhy Tutorial 05: Weighting, Overlap, And Common Support
This notebook focuses on the diagnostic side of propensity weighting. Inverse propensity weighting can be powerful, but it becomes fragile when treated and untreated units do not overlap. If some users are almost certain to receive treatment, or almost certain not to receive treatment, the data contain weak comparisons for those users.
We will compare two synthetic observational datasets: one with usable overlap and one with weak overlap. The causal graph and true treatment effect are the same in both cases. What changes is how separable the treated and untreated groups are.
Learning Goals
By the end of this notebook, you should be able to:
Explain overlap, common support, and positivity in practical language.
Estimate propensity scores and inspect treated-control overlap.
Use effective sample size to see when weights are fragile.
Compare raw and weighted covariate balance.
Understand why a weighting estimator can become unstable even when the causal graph is correct.
Run DoWhy’s propensity-score weighting estimator and read it alongside manual diagnostics.
Why Overlap Matters
Backdoor adjustment compares treated and untreated units with similar observed covariates. Propensity weighting does this by giving each unit a weight based on how surprising its observed treatment status was.
If a treated unit had a very low probability of being treated, its treated observation is rare and receives a large weight. If an untreated unit had a very high probability of being treated, its untreated observation is rare and receives a large weight. A few very large weights can dominate the estimate.
That is the practical overlap problem: the math may still run, but the estimate is supported by too few comparable observations.
Setup
This setup cell imports the packages used in the notebook, creates output folders, fixes a random seed, and suppresses known third-party compatibility warnings. The warning policy keeps expected library chatter out of the student-facing notebook while preserving real execution errors.
from pathlib import Pathimport osimport platformimport sysimport warningsSTART_DIR = Path.cwd().resolve()PROJECT_ROOT =next( (candidate for candidate in [START_DIR, *START_DIR.parents] if (candidate /"pyproject.toml").exists()), START_DIR,)NOTEBOOK_DIR = PROJECT_ROOT /"notebooks"/"tutorials"/"dowhy"OUTPUT_DIR = NOTEBOOK_DIR /"outputs"FIGURE_DIR = OUTPUT_DIR /"figures"TABLE_DIR = OUTPUT_DIR /"tables"CACHE_DIR = PROJECT_ROOT /".cache"/"matplotlib"for directory in [OUTPUT_DIR, FIGURE_DIR, TABLE_DIR, CACHE_DIR]: directory.mkdir(parents=True, exist_ok=True)os.environ.setdefault("MPLCONFIGDIR", str(CACHE_DIR))warnings.filterwarnings("default")warnings.filterwarnings("ignore", category=DeprecationWarning)warnings.filterwarnings("ignore", category=PendingDeprecationWarning)warnings.filterwarnings("ignore", category=FutureWarning)warnings.filterwarnings("ignore", message=".*IProgress not found.*")warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")warnings.filterwarnings("ignore", message=".*disp.*iprint.*L-BFGS-B.*")warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")warnings.filterwarnings("ignore", module="seaborn.categorical")warnings.filterwarnings("ignore", module="pydot.dot_parser")import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport statsmodels.formula.api as smffrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_scorefrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerimport dowhyfrom dowhy import CausalModelRANDOM_SEED =55rng = np.random.default_rng(RANDOM_SEED)sns.set_theme(style="whitegrid", context="notebook")print(f"Python executable: {sys.executable}")print(f"Python version: {platform.python_version()}")print(f"DoWhy version: {getattr(dowhy, '__version__', 'unknown')}")print(f"Notebook directory: {NOTEBOOK_DIR}")print(f"Output directory: {OUTPUT_DIR}")
The notebook is ready if this cell prints a DoWhy version. All generated artifacts from this notebook use a 05_ prefix.
Key Concepts
This table defines the vocabulary used throughout the notebook. These terms often appear together, but they answer slightly different diagnostic questions.
concept_table = pd.DataFrame( [ {"concept": "Propensity score","plain_language": "The probability of receiving treatment given observed covariates.","why_it_matters": "It summarizes observed treatment selection into one balancing score.", }, {"concept": "Overlap","plain_language": "Treated and untreated units exist at similar covariate or propensity values.","why_it_matters": "Without overlap, comparisons require extrapolation.", }, {"concept": "Common support","plain_language": "The region of propensity scores where both treatment groups are represented.","why_it_matters": "Estimates outside common support are weakly supported by data.", }, {"concept": "Positivity","plain_language": "Every covariate profile has a nonzero chance of receiving each treatment level.","why_it_matters": "If treatment is deterministic for some profiles, causal contrasts cannot be learned from observed data there.", }, {"concept": "Extreme weights","plain_language": "Very large inverse propensity weights from near-zero or near-one propensities.","why_it_matters": "A few units can dominate the estimate and inflate variance.", }, {"concept": "Effective sample size","plain_language": "The sample size implied by the concentration of weights.","why_it_matters": "A nominally large dataset can behave like a much smaller one after weighting.", }, ])concept_table.to_csv(TABLE_DIR /"05_weighting_concepts.csv", index=False)concept_table
concept
plain_language
why_it_matters
0
Propensity score
The probability of receiving treatment given o...
It summarizes observed treatment selection int...
1
Overlap
Treated and untreated units exist at similar c...
Without overlap, comparisons require extrapola...
2
Common support
The region of propensity scores where both tre...
Estimates outside common support are weakly su...
3
Positivity
Every covariate profile has a nonzero chance o...
If treatment is deterministic for some profile...
4
Extreme weights
Very large inverse propensity weights from nea...
A few units can dominate the estimate and infl...
5
Effective sample size
The sample size implied by the concentration o...
A nominally large dataset can behave like a mu...
The headline idea is simple: weighting is not only about computing a formula. It is also about checking whether the weighted comparison is supported by enough comparable observations.
Causal Question And Variable Roles
The causal question is the same in both overlap scenarios:
What is the average effect of feature_exposure on weekly_value?
The graph assumes all adjustment variables are observed pre-treatment common causes.
role_table = pd.DataFrame( [ {"variable": "feature_exposure", "role": "treatment", "timing": "treatment time", "adjustment_guidance": "treatment, not a control"}, {"variable": "weekly_value", "role": "outcome", "timing": "future outcome window", "adjustment_guidance": "outcome, not a control"}, {"variable": "user_engagement", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"}, {"variable": "prior_sessions", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"}, {"variable": "account_age_weeks", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"}, {"variable": "is_power_user", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"}, {"variable": "baseline_value", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"}, {"variable": "true_propensity", "role": "simulation diagnostic", "timing": "known only because this is simulated", "adjustment_guidance": "do not use as a real observed column"}, ])role_table.to_csv(TABLE_DIR /"05_variable_roles.csv", index=False)role_table
variable
role
timing
adjustment_guidance
0
feature_exposure
treatment
treatment time
treatment, not a control
1
weekly_value
outcome
future outcome window
outcome, not a control
2
user_engagement
observed common cause
pre-treatment
adjust
3
prior_sessions
observed common cause
pre-treatment
adjust
4
account_age_weeks
observed common cause
pre-treatment
adjust
5
is_power_user
observed common cause
pre-treatment
adjust
6
baseline_value
observed common cause
pre-treatment
adjust
7
true_propensity
simulation diagnostic
known only because this is simulated
do not use as a real observed column
The same roles apply in both simulated scenarios. This is important: the graph can be correct and the adjustment set can be right, while weighting is still unstable because overlap is weak.
Create Two Overlap Scenarios
This function creates two datasets with the same outcome equation and the same true causal effect. The only difference is treatment-selection strength.
In the usable-overlap case, baseline variables influence treatment, but not so strongly that treatment is almost deterministic.
In the weak-overlap case, baseline variables strongly separate treated and untreated users.
Rows: 10,000
Known true ATE in both scenarios: 1.6000
scenario
feature_exposure
weekly_value
user_engagement
prior_sessions
account_age_weeks
is_power_user
baseline_value
true_propensity
0
usable_overlap
1
9.855097
0.842261
1
8.283946
0
3.686182
0.557632
1
usable_overlap
1
2.480982
-2.976111
5
6.161959
0
-0.600032
0.118939
2
usable_overlap
1
6.655229
-0.305024
6
13.987074
0
2.533541
0.407891
3
usable_overlap
1
12.785204
1.449888
5
18.359764
0
4.458250
0.655687
4
usable_overlap
0
4.590193
-1.243961
2
10.443849
0
2.352953
0.281285
Both scenarios have the same true treatment effect. If estimates behave differently, the difference is coming from treatment assignment and overlap, not from a different causal effect.
Scenario Summary
This table compares treatment rates and true propensity ranges across the two scenarios.
The weak-overlap scenario has propensities much closer to zero and one. That means some treated or untreated observations will receive much larger inverse-propensity weights.
Estimate Propensity Scores In Each Scenario
In real observational data we do not know true propensities, so we estimate them. This cell fits a separate logistic propensity model in each scenario using the same observed common causes.
The weak-overlap scenario should have a higher propensity-model AUC because treatment assignment is easier to predict. For causal weighting, easier treatment prediction often means weaker overlap.
Plot Propensity Overlap
This plot compares treated and untreated propensity distributions in each scenario. Good overlap means both groups occupy similar regions of the propensity scale.
The weak-overlap panel should show more separation. That separation is the visual warning that weighting will rely on a smaller, more fragile set of comparable observations.
Common Support Diagnostics
Common support asks whether treated and untreated users exist over the same propensity range. This cell summarizes overlap using min/max ranges and the share of observations inside simple trimming bands.
The trimming-band shares show how much data would remain if we restricted analysis to less extreme propensity regions. Trimming improves stability, but it changes the target population.
Compute Weights And Effective Sample Size
This cell computes several weight variants:
Plain inverse propensity weights.
Stabilized weights, which multiply by marginal treatment probabilities.
Clipped weights, using propensities clipped to [0.01, 0.99] and [0.05, 0.95].
It also computes effective sample size, which falls when weights concentrate on a few units.
The weak-overlap scenario should have larger maximum weights and a smaller effective sample size. A nominal sample of thousands can behave like a much smaller sample if a few units receive huge weights.
Plot Weight Distributions
Weights are easier to diagnose on a log scale because the right tail is what usually causes trouble.
The weak-overlap scenario should have a heavier right tail. Those high-weight observations are the units that make the weighting estimator more fragile.
Balance Before And After Weighting
A good weighting model should reduce imbalance in observed covariates. This cell compares raw balance to IPW-weighted balance using standardized mean differences.
Weighted SMDs should move toward zero if the propensity model is balancing observed covariates. Balance can improve even when weights are unstable, so balance and weight diagnostics should be read together.
Plot Balance Diagnostics
This plot compares raw and weighted standardized mean differences across scenarios.
The weighted bars should shrink compared with the raw bars. If weighting balances covariates but the effective sample size collapses, the estimate may still be too fragile to trust without qualification.
Weighting Estimates Across Scenarios
Now we compute treatment-effect estimates for each scenario. The table includes:
Naive treated-minus-control difference.
Adjusted outcome regression.
IPW using clipped propensities.
Normalized IPW.
Stabilized-weight outcome mean difference.
Trimmed normalized IPW restricted to common propensity bands.
The usable-overlap estimates should cluster closer together. In the weak-overlap case, raw weighting can drift because a few units have too much influence. Trimming often stabilizes the number, but it estimates the effect for a narrower population.
Plot Weighting Estimates
This plot compares estimators by scenario. The dashed vertical line marks the known true ATE.
The weak-overlap panel should look less stable. The point sizes show how much data each trimmed estimator kept; trimming can reduce variance but changes the population being described.
Trimming Changes The Target Population
Trimming is not just a technical cleanup. It removes units from the tails of the propensity distribution, which often removes users with more extreme baseline profiles.
This cell summarizes how baseline covariates change after trimming.
The trimmed sample may have lower or higher average baseline characteristics than the full sample. That means the trimmed estimate is often more stable but applies to a more comparable subpopulation.
DoWhy Weighting Under Usable And Weak Overlap
Now we run DoWhy’s backdoor.propensity_score_weighting estimator under the same graph in both scenarios. This connects the manual diagnostics to DoWhy’s estimator interface.
DoWhy uses the same graph in both scenarios. If the weighting estimate is less stable in the weak-overlap scenario, the issue is not graph identification; it is support and weighting fragility.
Plot DoWhy Estimates Against Manual Diagnostics
This plot shows DoWhy’s regression and weighting estimates next to the known ATE.
The DoWhy weighting estimate can drift in weak overlap for the same reason manual IPW drifts: extreme propensities create extreme influence. A clean API call does not remove the need for overlap diagnostics.
Practical Decision Guide
This table summarizes what to do when weighting diagnostics look good, borderline, or poor.
decision_guide = pd.DataFrame( [ {"diagnostic_pattern": "Good overlap, small weights, improved balance","reasonable_next_step": "Report IPW or normalized IPW alongside regression and balance diagnostics.","caution": "Still depends on observed-confounding assumptions.", }, {"diagnostic_pattern": "Moderate tails but acceptable effective sample size","reasonable_next_step": "Compare clipped, stabilized, and normalized weights; report sensitivity to trimming.","caution": "Make clear whether trimming changes the target population.", }, {"diagnostic_pattern": "Extreme weights and small effective sample size","reasonable_next_step": "Avoid relying on raw IPW alone; consider trimming, overlap weights, redesign, or narrower estimand.","caution": "The full-population ATE may not be well supported by observed data.", }, {"diagnostic_pattern": "Balance remains poor after weighting","reasonable_next_step": "Revisit the propensity model and graph assumptions before interpreting the estimate.","caution": "A weighted estimate without balance is not reassuring.", }, ])decision_guide.to_csv(TABLE_DIR /"05_weighting_decision_guide.csv", index=False)decision_guide
diagnostic_pattern
reasonable_next_step
caution
0
Good overlap, small weights, improved balance
Report IPW or normalized IPW alongside regress...
Still depends on observed-confounding assumpti...
1
Moderate tails but acceptable effective sample...
Compare clipped, stabilized, and normalized we...
Make clear whether trimming changes the target...
2
Extreme weights and small effective sample size
Avoid relying on raw IPW alone; consider trimm...
The full-population ATE may not be well suppor...
3
Balance remains poor after weighting
Revisit the propensity model and graph assumpt...
A weighted estimate without balance is not rea...
This is the practical mindset: weighting is only credible when the weighted comparison is both balanced and supported by enough observations.
Final Summary
This final table gives a compact report-ready summary of the notebook’s lessons.
usable_norm = weighted_estimates.query("scenario == 'usable_overlap' and estimator == 'normalized_ipw_clip01'")["estimate"].iloc[0]weak_norm = weighted_estimates.query("scenario == 'weak_overlap' and estimator == 'normalized_ipw_clip01'")["estimate"].iloc[0]usable_ess = weight_diagnostics.query("scenario == 'usable_overlap' and weight_column == 'ipw_weight_clip01'")["effective_sample_size"].iloc[0]weak_ess = weight_diagnostics.query("scenario == 'weak_overlap' and weight_column == 'ipw_weight_clip01'")["effective_sample_size"].iloc[0]final_summary = pd.DataFrame( [ {"item": "Causal question", "summary": "Average effect of feature exposure on weekly value."}, {"item": "Known true ATE", "summary": f"{TRUE_ATE:.3f}"}, {"item": "Usable-overlap normalized IPW", "summary": f"{usable_norm:.3f}; effective sample size about {usable_ess:.0f}."}, {"item": "Weak-overlap normalized IPW", "summary": f"{weak_norm:.3f}; effective sample size about {weak_ess:.0f}."}, {"item": "Main diagnostic lesson", "summary": "Weak overlap creates large weights and makes weighting estimates more fragile."}, {"item": "Trimming lesson", "summary": "Trimming can stabilize estimates but changes the population being described."}, {"item": "DoWhy lesson", "summary": "DoWhy can estimate weighted effects, but overlap diagnostics remain the analyst's responsibility."}, {"item": "Main limitation", "summary": "All weighting estimators still depend on measured common causes and adequate support."}, ])final_summary.to_csv(TABLE_DIR /"05_final_weighting_summary.csv", index=False)final_summary
item
summary
0
Causal question
Average effect of feature exposure on weekly v...
1
Known true ATE
1.600
2
Usable-overlap normalized IPW
1.613; effective sample size about 4328.
3
Weak-overlap normalized IPW
2.171; effective sample size about 1020.
4
Main diagnostic lesson
Weak overlap creates large weights and makes w...
5
Trimming lesson
Trimming can stabilize estimates but changes t...
6
DoWhy lesson
DoWhy can estimate weighted effects, but overl...
7
Main limitation
All weighting estimators still depend on measu...
The key point is not that weighting is good or bad. Weighting is useful when the data support the comparison. When overlap is weak, the responsible answer may be a narrower estimand, a trimmed population, or a redesign of the analysis.
Student Exercises
Try these after running the notebook:
Increase treatment_selection_strength in the weak-overlap dataset and watch the effective sample size fall.
Change the clipping thresholds from [0.01, 0.99] to [0.02, 0.98] and compare estimates.
Compare trimming bands [0.05, 0.95], [0.10, 0.90], and [0.20, 0.80].
Add a nonlinear term to the propensity model and see whether balance improves.
Remove one confounder from the propensity model and inspect both balance and bias.
Write a short stakeholder summary explaining why a full-population ATE may not be supported under weak overlap.
Closing Notes
This notebook showed that inverse propensity weighting is not just a formula. It requires overlap, reasonable weights, adequate effective sample size, and improved covariate balance. The next notebook will move beyond backdoor weighting and introduce frontdoor, instrumental-variable, and natural-experiment logic.