DoWhy Tutorial 05: Weighting, Overlap, And Common Support

This notebook focuses on the diagnostic side of propensity weighting. Inverse propensity weighting can be powerful, but it becomes fragile when treated and untreated units do not overlap. If some users are almost certain to receive treatment, or almost certain not to receive treatment, the data contain weak comparisons for those users.

We will compare two synthetic observational datasets: one with usable overlap and one with weak overlap. The causal graph and true treatment effect are the same in both cases. What changes is how separable the treated and untreated groups are.

Learning Goals

By the end of this notebook, you should be able to:

Explain overlap, common support, and positivity in practical language.
Estimate propensity scores and inspect treated-control overlap.
Compute inverse propensity weights, stabilized weights, normalized IPW, and trimmed IPW estimates.
Use effective sample size to see when weights are fragile.
Compare raw and weighted covariate balance.
Understand why a weighting estimator can become unstable even when the causal graph is correct.
Run DoWhy’s propensity-score weighting estimator and read it alongside manual diagnostics.

Why Overlap Matters

Backdoor adjustment compares treated and untreated units with similar observed covariates. Propensity weighting does this by giving each unit a weight based on how surprising its observed treatment status was.

If a treated unit had a very low probability of being treated, its treated observation is rare and receives a large weight. If an untreated unit had a very high probability of being treated, its untreated observation is rare and receives a large weight. A few very large weights can dominate the estimate.

That is the practical overlap problem: the math may still run, but the estimate is supported by too few comparable observations.

Setup

This setup cell imports the packages used in the notebook, creates output folders, fixes a random seed, and suppresses known third-party compatibility warnings. The warning policy keeps expected library chatter out of the student-facing notebook while preserving real execution errors.

from pathlib import Path
import os
import platform
import sys
import warnings

START_DIR = Path.cwd().resolve()
PROJECT_ROOT = next(
    (candidate for candidate in [START_DIR, *START_DIR.parents] if (candidate / "pyproject.toml").exists()),
    START_DIR,
)

NOTEBOOK_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "dowhy"
OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
CACHE_DIR = PROJECT_ROOT / ".cache" / "matplotlib"

for directory in [OUTPUT_DIR, FIGURE_DIR, TABLE_DIR, CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(CACHE_DIR))

warnings.filterwarnings("default")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=PendingDeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*IProgress not found.*")
warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")
warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")
warnings.filterwarnings("ignore", message=".*disp.*iprint.*L-BFGS-B.*")
warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")
warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")
warnings.filterwarnings("ignore", module="seaborn.categorical")
warnings.filterwarnings("ignore", module="pydot.dot_parser")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import dowhy
from dowhy import CausalModel

RANDOM_SEED = 55
rng = np.random.default_rng(RANDOM_SEED)

sns.set_theme(style="whitegrid", context="notebook")

print(f"Python executable: {sys.executable}")
print(f"Python version: {platform.python_version()}")
print(f"DoWhy version: {getattr(dowhy, '__version__', 'unknown')}")
print(f"Notebook directory: {NOTEBOOK_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Python executable: /home/apex/Documents/ranking_sys/.venv/bin/python3
Python version: 3.13.12
DoWhy version: 0.14
Notebook directory: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy
Output directory: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy/outputs

The notebook is ready if this cell prints a DoWhy version. All generated artifacts from this notebook use a 05_ prefix.

Key Concepts

This table defines the vocabulary used throughout the notebook. These terms often appear together, but they answer slightly different diagnostic questions.

concept_table = pd.DataFrame(
    [
        {
            "concept": "Propensity score",
            "plain_language": "The probability of receiving treatment given observed covariates.",
            "why_it_matters": "It summarizes observed treatment selection into one balancing score.",
        },
        {
            "concept": "Overlap",
            "plain_language": "Treated and untreated units exist at similar covariate or propensity values.",
            "why_it_matters": "Without overlap, comparisons require extrapolation.",
        },
        {
            "concept": "Common support",
            "plain_language": "The region of propensity scores where both treatment groups are represented.",
            "why_it_matters": "Estimates outside common support are weakly supported by data.",
        },
        {
            "concept": "Positivity",
            "plain_language": "Every covariate profile has a nonzero chance of receiving each treatment level.",
            "why_it_matters": "If treatment is deterministic for some profiles, causal contrasts cannot be learned from observed data there.",
        },
        {
            "concept": "Extreme weights",
            "plain_language": "Very large inverse propensity weights from near-zero or near-one propensities.",
            "why_it_matters": "A few units can dominate the estimate and inflate variance.",
        },
        {
            "concept": "Effective sample size",
            "plain_language": "The sample size implied by the concentration of weights.",
            "why_it_matters": "A nominally large dataset can behave like a much smaller one after weighting.",
        },
    ]
)

concept_table.to_csv(TABLE_DIR / "05_weighting_concepts.csv", index=False)
concept_table

	concept	plain_language	why_it_matters
0	Propensity score	The probability of receiving treatment given o...	It summarizes observed treatment selection int...
1	Overlap	Treated and untreated units exist at similar c...	Without overlap, comparisons require extrapola...
2	Common support	The region of propensity scores where both tre...	Estimates outside common support are weakly su...
3	Positivity	Every covariate profile has a nonzero chance o...	If treatment is deterministic for some profile...
4	Extreme weights	Very large inverse propensity weights from nea...	A few units can dominate the estimate and infl...
5	Effective sample size	The sample size implied by the concentration o...	A nominally large dataset can behave like a mu...

The headline idea is simple: weighting is not only about computing a formula. It is also about checking whether the weighted comparison is supported by enough comparable observations.

Causal Question And Variable Roles

The causal question is the same in both overlap scenarios:

What is the average effect of feature_exposure on weekly_value?

The graph assumes all adjustment variables are observed pre-treatment common causes.

role_table = pd.DataFrame(
    [
        {"variable": "feature_exposure", "role": "treatment", "timing": "treatment time", "adjustment_guidance": "treatment, not a control"},
        {"variable": "weekly_value", "role": "outcome", "timing": "future outcome window", "adjustment_guidance": "outcome, not a control"},
        {"variable": "user_engagement", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"},
        {"variable": "prior_sessions", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"},
        {"variable": "account_age_weeks", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"},
        {"variable": "is_power_user", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"},
        {"variable": "baseline_value", "role": "observed common cause", "timing": "pre-treatment", "adjustment_guidance": "adjust"},
        {"variable": "true_propensity", "role": "simulation diagnostic", "timing": "known only because this is simulated", "adjustment_guidance": "do not use as a real observed column"},
    ]
)

role_table.to_csv(TABLE_DIR / "05_variable_roles.csv", index=False)
role_table

	variable	role	timing	adjustment_guidance
0	feature_exposure	treatment	treatment time	treatment, not a control
1	weekly_value	outcome	future outcome window	outcome, not a control
2	user_engagement	observed common cause	pre-treatment	adjust
3	prior_sessions	observed common cause	pre-treatment	adjust
4	account_age_weeks	observed common cause	pre-treatment	adjust
5	is_power_user	observed common cause	pre-treatment	adjust
6	baseline_value	observed common cause	pre-treatment	adjust
7	true_propensity	simulation diagnostic	known only because this is simulated	do not use as a real observed column

The same roles apply in both simulated scenarios. This is important: the graph can be correct and the adjustment set can be right, while weighting is still unstable because overlap is weak.

Create Two Overlap Scenarios

This function creates two datasets with the same outcome equation and the same true causal effect. The only difference is treatment-selection strength.

In the usable-overlap case, baseline variables influence treatment, but not so strongly that treatment is almost deterministic.
In the weak-overlap case, baseline variables strongly separate treated and untreated users.

def make_overlap_data(n=5_000, treatment_selection_strength=0.55, seed=55, scenario="usable_overlap"):
    local_rng = np.random.default_rng(seed)

    user_engagement = local_rng.normal(loc=0.0, scale=1.0, size=n)
    prior_sessions = np.clip(
        local_rng.poisson(lam=np.exp(1.0 + 0.20 * user_engagement), size=n),
        0,
        30,
    )
    account_age_weeks = local_rng.gamma(shape=2.2, scale=3.5, size=n)
    is_power_user = local_rng.binomial(
        n=1,
        p=1 / (1 + np.exp(-(0.75 * user_engagement - 0.20))),
        size=n,
    )
    baseline_value = (
        2.0
        + 1.00 * user_engagement
        + 0.050 * prior_sessions
        + 0.020 * account_age_weeks
        + 0.550 * is_power_user
        + local_rng.normal(loc=0.0, scale=1.0, size=n)
    )

    selection_score = (
        0.90 * user_engagement
        + 0.050 * prior_sessions
        + 0.200 * baseline_value
        + 0.550 * is_power_user
        - 0.015 * account_age_weeks
    )
    treatment_logit = -0.55 + treatment_selection_strength * selection_score
    true_propensity = 1 / (1 + np.exp(-treatment_logit))
    feature_exposure = local_rng.binomial(n=1, p=true_propensity, size=n)

    true_ate = 1.60
    weekly_value = (
        4.0
        + true_ate * feature_exposure
        + 1.20 * user_engagement
        + 0.060 * prior_sessions
        + 0.030 * account_age_weeks
        + 0.600 * baseline_value
        + 0.450 * is_power_user
        + local_rng.normal(loc=0.0, scale=1.20, size=n)
    )

    return pd.DataFrame(
        {
            "scenario": scenario,
            "feature_exposure": feature_exposure,
            "weekly_value": weekly_value,
            "user_engagement": user_engagement,
            "prior_sessions": prior_sessions,
            "account_age_weeks": account_age_weeks,
            "is_power_user": is_power_user,
            "baseline_value": baseline_value,
            "true_propensity": true_propensity,
        }
    ), true_ate

usable_df, TRUE_ATE = make_overlap_data(treatment_selection_strength=0.55, seed=55, scenario="usable_overlap")
weak_df, _ = make_overlap_data(treatment_selection_strength=2.20, seed=56, scenario="weak_overlap")

overlap_df = pd.concat([usable_df, weak_df], ignore_index=True)
overlap_df.to_csv(TABLE_DIR / "05_overlap_teaching_dataset.csv", index=False)

print(f"Rows: {len(overlap_df):,}")
print(f"Known true ATE in both scenarios: {TRUE_ATE:.4f}")
overlap_df.head()

Rows: 10,000
Known true ATE in both scenarios: 1.6000

	scenario	feature_exposure	weekly_value	user_engagement	prior_sessions	account_age_weeks	baseline_value	true_propensity
0	usable_overlap	1	9.855097	0.842261	1	8.283946	3.686182	0.557632
1	usable_overlap	1	2.480982	-2.976111	5	6.161959	-0.600032	0.118939
2	usable_overlap	1	6.655229	-0.305024	6	13.987074	2.533541	0.407891
3	usable_overlap	1	12.785204	1.449888	5	18.359764	4.458250	0.655687
4	usable_overlap	0	4.590193	-1.243961	2	10.443849	2.352953	0.281285

Both scenarios have the same true treatment effect. If estimates behave differently, the difference is coming from treatment assignment and overlap, not from a different causal effect.

Scenario Summary

This table compares treatment rates and true propensity ranges across the two scenarios.

scenario_summary = (
    overlap_df.groupby("scenario")
    .agg(
        rows=("weekly_value", "size"),
        treatment_rate=("feature_exposure", "mean"),
        outcome_mean=("weekly_value", "mean"),
        true_propensity_min=("true_propensity", "min"),
        true_propensity_p01=("true_propensity", lambda s: s.quantile(0.01)),
        true_propensity_median=("true_propensity", "median"),
        true_propensity_p99=("true_propensity", lambda s: s.quantile(0.99)),
        true_propensity_max=("true_propensity", "max"),
    )
    .reset_index()
)

scenario_summary.to_csv(TABLE_DIR / "05_scenario_summary.csv", index=False)
scenario_summary

	scenario	rows	treatment_rate	outcome_mean	true_propensity_min	true_propensity_p01	true_propensity_median	true_propensity_p99	true_propensity_max
0	usable_overlap	5000	0.4678	6.831726	0.078186	0.152451	0.464421	0.819403	0.920257
1	weak_overlap	5000	0.6260	7.164682	0.000377	0.005390	0.758453	0.999701	0.999956

The weak-overlap scenario has propensities much closer to zero and one. That means some treated or untreated observations will receive much larger inverse-propensity weights.

Estimate Propensity Scores In Each Scenario

In real observational data we do not know true propensities, so we estimate them. This cell fits a separate logistic propensity model in each scenario using the same observed common causes.

confounder_cols = ["user_engagement", "prior_sessions", "account_age_weeks", "is_power_user", "baseline_value"]

scored_parts = []
propensity_model_rows = []
for scenario, scenario_df in overlap_df.groupby("scenario"):
    X = scenario_df[confounder_cols]
    treatment = scenario_df["feature_exposure"]

    propensity_model = make_pipeline(
        StandardScaler(),
        LogisticRegression(max_iter=1_000, random_state=RANDOM_SEED),
    )
    propensity_model.fit(X, treatment)
    estimated_propensity = propensity_model.predict_proba(X)[:, 1]

    scenario_scored = scenario_df.copy()
    scenario_scored["estimated_propensity"] = estimated_propensity
    scenario_scored["clipped_propensity_01"] = np.clip(estimated_propensity, 0.01, 0.99)
    scenario_scored["clipped_propensity_05"] = np.clip(estimated_propensity, 0.05, 0.95)
    scored_parts.append(scenario_scored)

    propensity_model_rows.append(
        {
            "scenario": scenario,
            "propensity_auc": roc_auc_score(treatment, estimated_propensity),
            "estimated_propensity_min": estimated_propensity.min(),
            "estimated_propensity_p01": np.quantile(estimated_propensity, 0.01),
            "estimated_propensity_median": np.median(estimated_propensity),
            "estimated_propensity_p99": np.quantile(estimated_propensity, 0.99),
            "estimated_propensity_max": estimated_propensity.max(),
        }
    )

scored_df = pd.concat(scored_parts, ignore_index=True)
propensity_model_summary = pd.DataFrame(propensity_model_rows)

scored_df.to_csv(TABLE_DIR / "05_scored_overlap_dataset.csv", index=False)
propensity_model_summary.to_csv(TABLE_DIR / "05_propensity_model_summary.csv", index=False)
propensity_model_summary

	scenario	propensity_auc	estimated_propensity_min	estimated_propensity_p01	estimated_propensity_median	estimated_propensity_p99	estimated_propensity_max
0	usable_overlap	0.687405	0.081002	0.155716	0.460928	0.814791	0.910762
1	weak_overlap	0.917156	0.000352	0.004665	0.752000	0.999692	0.999958

The weak-overlap scenario should have a higher propensity-model AUC because treatment assignment is easier to predict. For causal weighting, easier treatment prediction often means weaker overlap.

Plot Propensity Overlap

This plot compares treated and untreated propensity distributions in each scenario. Good overlap means both groups occupy similar regions of the propensity scale.

g = sns.displot(
    data=scored_df,
    x="estimated_propensity",
    hue="feature_exposure",
    col="scenario",
    bins=55,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    height=4.5,
    aspect=1.15,
)
g.set_axis_labels("Estimated propensity score", "Density")
g.set_titles("{col_name}")
g.fig.suptitle("Propensity Overlap By Scenario", y=1.05)
g.fig.savefig(FIGURE_DIR / "05_propensity_overlap_by_scenario.png", dpi=160, bbox_inches="tight")
plt.show()

The weak-overlap panel should show more separation. That separation is the visual warning that weighting will rely on a smaller, more fragile set of comparable observations.

Common Support Diagnostics

Common support asks whether treated and untreated users exist over the same propensity range. This cell summarizes overlap using min/max ranges and the share of observations inside simple trimming bands.

def common_support_summary(df):
    rows = []
    for scenario, scenario_df in df.groupby("scenario"):
        treated_ps = scenario_df.loc[scenario_df["feature_exposure"] == 1, "estimated_propensity"]
        control_ps = scenario_df.loc[scenario_df["feature_exposure"] == 0, "estimated_propensity"]
        lower_support = max(treated_ps.min(), control_ps.min())
        upper_support = min(treated_ps.max(), control_ps.max())
        in_empirical_support = scenario_df["estimated_propensity"].between(lower_support, upper_support)
        rows.append(
            {
                "scenario": scenario,
                "treated_ps_min": treated_ps.min(),
                "treated_ps_max": treated_ps.max(),
                "control_ps_min": control_ps.min(),
                "control_ps_max": control_ps.max(),
                "empirical_common_support_lower": lower_support,
                "empirical_common_support_upper": upper_support,
                "share_inside_empirical_common_support": in_empirical_support.mean(),
                "share_inside_01_99_band": scenario_df["estimated_propensity"].between(0.01, 0.99).mean(),
                "share_inside_05_95_band": scenario_df["estimated_propensity"].between(0.05, 0.95).mean(),
                "share_inside_10_90_band": scenario_df["estimated_propensity"].between(0.10, 0.90).mean(),
            }
        )
    return pd.DataFrame(rows)

support_summary = common_support_summary(scored_df)
support_summary.to_csv(TABLE_DIR / "05_common_support_summary.csv", index=False)
support_summary

	scenario	treated_ps_min	treated_ps_max	control_ps_min	control_ps_max	empirical_common_support_lower	empirical_common_support_upper	share_inside_empirical_common_support	share_inside_01_99_band	share_inside_05_95_band	share_inside_10_90_band
0	usable_overlap	0.095802	0.910762	0.081002	0.872378	0.095802	0.872378	0.9980	1.000	1.000	0.9986
1	weak_overlap	0.023944	0.999958	0.000352	0.999121	0.023944	0.999121	0.9312	0.861	0.652	0.5194

The trimming-band shares show how much data would remain if we restricted analysis to less extreme propensity regions. Trimming improves stability, but it changes the target population.

Compute Weights And Effective Sample Size

This cell computes several weight variants:

Plain inverse propensity weights.
Stabilized weights, which multiply by marginal treatment probabilities.
Clipped weights, using propensities clipped to [0.01, 0.99] and [0.05, 0.95].

It also computes effective sample size, which falls when weights concentrate on a few units.

def add_weight_columns(df):
    weighted_parts = []
    for scenario, scenario_df in df.groupby("scenario"):
        part = scenario_df.copy()
        treatment = part["feature_exposure"].to_numpy()
        p_treated = treatment.mean()

        for propensity_col, suffix in [
            ("estimated_propensity", "raw"),
            ("clipped_propensity_01", "clip01"),
            ("clipped_propensity_05", "clip05"),
        ]:
            propensity = part[propensity_col].to_numpy()
            part[f"ipw_weight_{suffix}"] = treatment / propensity + (1 - treatment) / (1 - propensity)
            part[f"stabilized_weight_{suffix}"] = (
                treatment * p_treated / propensity
                + (1 - treatment) * (1 - p_treated) / (1 - propensity)
            )
        weighted_parts.append(part)
    return pd.concat(weighted_parts, ignore_index=True)

def effective_sample_size(weights):
    weights = np.asarray(weights)
    return weights.sum() ** 2 / np.sum(weights ** 2)

weighted_df = add_weight_columns(scored_df)

weight_rows = []
for scenario, scenario_df in weighted_df.groupby("scenario"):
    for weight_col in [
        "ipw_weight_raw",
        "ipw_weight_clip01",
        "ipw_weight_clip05",
        "stabilized_weight_raw",
        "stabilized_weight_clip01",
        "stabilized_weight_clip05",
    ]:
        weights = scenario_df[weight_col]
        weight_rows.append(
            {
                "scenario": scenario,
                "weight_column": weight_col,
                "mean_weight": weights.mean(),
                "max_weight": weights.max(),
                "p95_weight": weights.quantile(0.95),
                "p99_weight": weights.quantile(0.99),
                "effective_sample_size": effective_sample_size(weights),
                "nominal_sample_size": len(weights),
                "ess_share_of_nominal": effective_sample_size(weights) / len(weights),
            }
        )

weight_diagnostics = pd.DataFrame(weight_rows)
weighted_df.to_csv(TABLE_DIR / "05_weighted_overlap_dataset.csv", index=False)
weight_diagnostics.to_csv(TABLE_DIR / "05_weight_diagnostics.csv", index=False)
weight_diagnostics

	scenario	weight_column	mean_weight	max_weight	p95_weight	p99_weight	effective_sample_size	nominal_sample_size	ess_share_of_nominal
0	usable_overlap	ipw_weight_raw	2.000951	10.438142	3.501863	4.870062	4328.470395	5000	0.865694
1	usable_overlap	ipw_weight_clip01	2.000951	10.438142	3.501863	4.870062	4328.470395	5000	0.865694
2	usable_overlap	ipw_weight_clip05	2.000951	10.438142	3.501863	4.870062	4328.470395	5000	0.865694
3	usable_overlap	stabilized_weight_raw	1.000354	4.882963	1.725269	2.411711	4357.480619	5000	0.871496
4	usable_overlap	stabilized_weight_clip01	1.000354	4.882963	1.725269	2.411711	4357.480619	5000	0.871496
5	usable_overlap	stabilized_weight_clip05	1.000354	4.882963	1.725269	2.411711	4357.480619	5000	0.871496
6	weak_overlap	ipw_weight_raw	2.093583	1138.156700	4.347776	11.870133	79.118850	5000	0.015824
7	weak_overlap	ipw_weight_clip01	1.876280	100.000000	4.347776	11.870133	1019.649847	5000	0.203930
8	weak_overlap	ipw_weight_clip05	1.771379	20.000000	4.347776	11.870133	2108.846973	5000	0.421769
9	weak_overlap	stabilized_weight_raw	1.026361	425.670606	2.179343	6.013635	132.610950	5000	0.026522
10	weak_overlap	stabilized_weight_clip01	0.945272	37.400000	2.179343	6.013635	1318.472419	5000	0.263694
11	weak_overlap	stabilized_weight_clip05	0.903971	12.520000	2.179343	6.013635	2301.196033	5000	0.460239

The weak-overlap scenario should have larger maximum weights and a smaller effective sample size. A nominal sample of thousands can behave like a much smaller sample if a few units receive huge weights.

Plot Weight Distributions

Weights are easier to diagnose on a log scale because the right tail is what usually causes trouble.

weight_plot_df = weighted_df[["scenario", "feature_exposure", "ipw_weight_raw", "ipw_weight_clip01", "ipw_weight_clip05"]].melt(
    id_vars=["scenario", "feature_exposure"],
    var_name="weight_type",
    value_name="weight",
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)
for ax, scenario in zip(axes, ["usable_overlap", "weak_overlap"]):
    scenario_plot = weight_plot_df.query("scenario == @scenario and weight_type in ['ipw_weight_raw', 'ipw_weight_clip01']")
    sns.histplot(
        data=scenario_plot,
        x="weight",
        hue="weight_type",
        bins=70,
        stat="density",
        common_norm=False,
        element="step",
        fill=False,
        ax=ax,
    )
    ax.set_xscale("log")
    ax.set_title(scenario)
    ax.set_xlabel("IPW weight, log scale")
    ax.set_ylabel("Density")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "05_weight_distributions.png", dpi=160, bbox_inches="tight")
plt.show()

The weak-overlap scenario should have a heavier right tail. Those high-weight observations are the units that make the weighting estimator more fragile.

Balance Before And After Weighting

A good weighting model should reduce imbalance in observed covariates. This cell compares raw balance to IPW-weighted balance using standardized mean differences.

def weighted_mean(values, weights):
    values = np.asarray(values)
    weights = np.asarray(weights)
    return np.sum(values * weights) / np.sum(weights)

def weighted_var(values, weights):
    values = np.asarray(values)
    weights = np.asarray(weights)
    mean = weighted_mean(values, weights)
    return np.sum(weights * (values - mean) ** 2) / np.sum(weights)

def standardized_mean_difference(df, column, weights=None):
    treatment = df["feature_exposure"].to_numpy() == 1
    values = df[column].to_numpy()
    if weights is None:
        treated_values = values[treatment]
        control_values = values[~treatment]
        pooled_sd = np.sqrt((treated_values.var(ddof=1) + control_values.var(ddof=1)) / 2)
        return (treated_values.mean() - control_values.mean()) / pooled_sd

    weights = np.asarray(weights)
    treated_mean = weighted_mean(values[treatment], weights[treatment])
    control_mean = weighted_mean(values[~treatment], weights[~treatment])
    treated_var = weighted_var(values[treatment], weights[treatment])
    control_var = weighted_var(values[~treatment], weights[~treatment])
    pooled_sd = np.sqrt((treated_var + control_var) / 2)
    return (treated_mean - control_mean) / pooled_sd

confounder_cols = ["user_engagement", "prior_sessions", "account_age_weeks", "is_power_user", "baseline_value"]

balance_rows = []
for scenario, scenario_df in weighted_df.groupby("scenario"):
    for covariate in confounder_cols:
        balance_rows.append(
            {
                "scenario": scenario,
                "covariate": covariate,
                "raw_smd": standardized_mean_difference(scenario_df, covariate),
                "weighted_smd_clip01": standardized_mean_difference(
                    scenario_df,
                    covariate,
                    weights=scenario_df["ipw_weight_clip01"].to_numpy(),
                ),
                "weighted_smd_clip05": standardized_mean_difference(
                    scenario_df,
                    covariate,
                    weights=scenario_df["ipw_weight_clip05"].to_numpy(),
                ),
            }
        )

balance_table = pd.DataFrame(balance_rows)
balance_table.to_csv(TABLE_DIR / "05_balance_before_after_weighting.csv", index=False)
balance_table

	scenario	covariate	raw_smd	weighted_smd_clip01	weighted_smd_clip05
0	usable_overlap	user_engagement	0.646501	-0.006006	-0.006006
1	usable_overlap	prior_sessions	0.264855	0.005227	0.005227
2	usable_overlap	account_age_weeks	-0.027640	-0.002722	-0.002722
3	usable_overlap	is_power_user	0.389441	-0.000155	-0.000155
4	usable_overlap	baseline_value	0.556563	-0.001766	-0.001766
5	weak_overlap	user_engagement	1.737636	0.263999	0.504594
6	weak_overlap	prior_sessions	0.502664	0.077280	0.148370
7	weak_overlap	account_age_weeks	-0.050817	0.001329	0.013894
8	weak_overlap	is_power_user	0.881697	0.161302	0.296004
9	weak_overlap	baseline_value	1.433895	0.152070	0.369736

Weighted SMDs should move toward zero if the propensity model is balancing observed covariates. Balance can improve even when weights are unstable, so balance and weight diagnostics should be read together.

Plot Balance Diagnostics

This plot compares raw and weighted standardized mean differences across scenarios.

balance_plot_df = balance_table.melt(
    id_vars=["scenario", "covariate"],
    value_vars=["raw_smd", "weighted_smd_clip01", "weighted_smd_clip05"],
    var_name="balance_type",
    value_name="smd",
)

g = sns.catplot(
    data=balance_plot_df,
    x="smd",
    y="covariate",
    hue="balance_type",
    col="scenario",
    kind="bar",
    height=4.8,
    aspect=1.15,
    sharex=True,
)
for ax in g.axes.flat:
    ax.axvline(0, color="#111827", linewidth=1)
    ax.axvline(0.1, color="#9ca3af", linestyle="--", linewidth=1)
    ax.axvline(-0.1, color="#9ca3af", linestyle="--", linewidth=1)
g.set_axis_labels("Standardized mean difference", "")
g.set_titles("{col_name}")
g.fig.suptitle("Raw And Weighted Balance", y=1.05)
g.fig.savefig(FIGURE_DIR / "05_balance_before_after_weighting.png", dpi=160, bbox_inches="tight")
plt.show()

The weighted bars should shrink compared with the raw bars. If weighting balances covariates but the effective sample size collapses, the estimate may still be too fragile to trust without qualification.

Weighting Estimates Across Scenarios

Now we compute treatment-effect estimates for each scenario. The table includes:

Naive treated-minus-control difference.
Adjusted outcome regression.
IPW using clipped propensities.
Normalized IPW.
Stabilized-weight outcome mean difference.
Trimmed normalized IPW restricted to common propensity bands.

def estimate_weighted_effects(df, true_ate):
    rows = []
    for scenario, scenario_df in df.groupby("scenario"):
        treatment = scenario_df["feature_exposure"].to_numpy()
        outcome = scenario_df["weekly_value"].to_numpy()
        p01 = scenario_df["clipped_propensity_01"].to_numpy()
        p05 = scenario_df["clipped_propensity_05"].to_numpy()

        naive = scenario_df.loc[scenario_df["feature_exposure"] == 1, "weekly_value"].mean() - scenario_df.loc[
            scenario_df["feature_exposure"] == 0,
            "weekly_value",
        ].mean()

        regression_fit = smf.ols(
            formula="weekly_value ~ feature_exposure + user_engagement + prior_sessions + account_age_weeks + is_power_user + baseline_value",
            data=scenario_df,
        ).fit()
        regression_estimate = regression_fit.params["feature_exposure"]

        ipw_clip01 = np.mean(treatment * outcome / p01 - (1 - treatment) * outcome / (1 - p01))
        normalized_ipw_clip01 = (
            np.sum(treatment * outcome / p01) / np.sum(treatment / p01)
            - np.sum((1 - treatment) * outcome / (1 - p01)) / np.sum((1 - treatment) / (1 - p01))
        )

        ipw_clip05 = np.mean(treatment * outcome / p05 - (1 - treatment) * outcome / (1 - p05))
        normalized_ipw_clip05 = (
            np.sum(treatment * outcome / p05) / np.sum(treatment / p05)
            - np.sum((1 - treatment) * outcome / (1 - p05)) / np.sum((1 - treatment) / (1 - p05))
        )

        p_treated = treatment.mean()
        stabilized_weights = treatment * p_treated / p01 + (1 - treatment) * (1 - p_treated) / (1 - p01)
        stabilized_treated_mean = weighted_mean(outcome[treatment == 1], stabilized_weights[treatment == 1])
        stabilized_control_mean = weighted_mean(outcome[treatment == 0], stabilized_weights[treatment == 0])
        stabilized_difference = stabilized_treated_mean - stabilized_control_mean

        trim_mask_05_95 = scenario_df["estimated_propensity"].between(0.05, 0.95).to_numpy()
        trim_mask_10_90 = scenario_df["estimated_propensity"].between(0.10, 0.90).to_numpy()

        for estimator, estimate, rows_used in [
            ("naive_difference", naive, len(scenario_df)),
            ("adjusted_outcome_regression", regression_estimate, len(scenario_df)),
            ("ipw_clip01", ipw_clip01, len(scenario_df)),
            ("normalized_ipw_clip01", normalized_ipw_clip01, len(scenario_df)),
            ("ipw_clip05", ipw_clip05, len(scenario_df)),
            ("normalized_ipw_clip05", normalized_ipw_clip05, len(scenario_df)),
            ("stabilized_weight_difference_clip01", stabilized_difference, len(scenario_df)),
        ]:
            rows.append(
                {
                    "scenario": scenario,
                    "estimator": estimator,
                    "estimate": estimate,
                    "rows_used": rows_used,
                    "share_rows_used": rows_used / len(scenario_df),
                    "absolute_error_vs_true_ate": abs(estimate - true_ate),
                }
            )

        for trim_name, trim_mask in [("trimmed_05_95", trim_mask_05_95), ("trimmed_10_90", trim_mask_10_90)]:
            trimmed = scenario_df.loc[trim_mask].copy()
            tt = trimmed["feature_exposure"].to_numpy()
            yy = trimmed["weekly_value"].to_numpy()
            pp = trimmed["clipped_propensity_01"].to_numpy()
            trimmed_normalized_ipw = (
                np.sum(tt * yy / pp) / np.sum(tt / pp)
                - np.sum((1 - tt) * yy / (1 - pp)) / np.sum((1 - tt) / (1 - pp))
            )
            rows.append(
                {
                    "scenario": scenario,
                    "estimator": f"normalized_ipw_{trim_name}",
                    "estimate": trimmed_normalized_ipw,
                    "rows_used": len(trimmed),
                    "share_rows_used": len(trimmed) / len(scenario_df),
                    "absolute_error_vs_true_ate": abs(trimmed_normalized_ipw - true_ate),
                }
            )
    return pd.DataFrame(rows)

weighted_estimates = estimate_weighted_effects(weighted_df, TRUE_ATE)
weighted_estimates.to_csv(TABLE_DIR / "05_weighted_estimates.csv", index=False)
weighted_estimates

	scenario	estimator	estimate	rows_used	share_rows_used	absolute_error_vs_true_ate
0	usable_overlap	naive_difference	2.994493	5000	1.0000	1.394493
1	usable_overlap	adjusted_outcome_regression	1.620507	5000	1.0000	0.020507
2	usable_overlap	ipw_clip01	1.639285	5000	1.0000	0.039285
3	usable_overlap	normalized_ipw_clip01	1.612694	5000	1.0000	0.012694
4	usable_overlap	ipw_clip05	1.639285	5000	1.0000	0.039285
5	usable_overlap	normalized_ipw_clip05	1.612694	5000	1.0000	0.012694
6	usable_overlap	stabilized_weight_difference_clip01	1.612694	5000	1.0000	0.012694
7	usable_overlap	normalized_ipw_trimmed_05_95	1.612694	5000	1.0000	0.012694
8	usable_overlap	normalized_ipw_trimmed_10_90	1.615375	4993	0.9986	0.015375
9	weak_overlap	naive_difference	4.567569	5000	1.0000	2.967569
10	weak_overlap	adjusted_outcome_regression	1.683087	5000	1.0000	0.083087
11	weak_overlap	ipw_clip01	2.427237	5000	1.0000	0.827237
12	weak_overlap	normalized_ipw_clip01	2.170689	5000	1.0000	0.570689
13	weak_overlap	ipw_clip05	3.323897	5000	1.0000	1.723897
14	weak_overlap	normalized_ipw_clip05	2.646909	5000	1.0000	1.046909
15	weak_overlap	stabilized_weight_difference_clip01	2.170689	5000	1.0000	0.570689
16	weak_overlap	normalized_ipw_trimmed_05_95	1.685725	3260	0.6520	0.085725
17	weak_overlap	normalized_ipw_trimmed_10_90	1.633528	2597	0.5194	0.033528

The usable-overlap estimates should cluster closer together. In the weak-overlap case, raw weighting can drift because a few units have too much influence. Trimming often stabilizes the number, but it estimates the effect for a narrower population.

Plot Weighting Estimates

This plot compares estimators by scenario. The dashed vertical line marks the known true ATE.

fig, axes = plt.subplots(1, 2, figsize=(15, 6), sharex=True)
for ax, scenario in zip(axes, ["usable_overlap", "weak_overlap"]):
    scenario_plot = weighted_estimates.query("scenario == @scenario").copy()
    sns.scatterplot(
        data=scenario_plot,
        x="estimate",
        y="estimator",
        size="share_rows_used",
        sizes=(45, 120),
        color="#2563eb",
        legend=False,
        ax=ax,
    )
    ax.axvline(TRUE_ATE, color="#111827", linestyle="--", linewidth=1.4, label="Known true ATE")
    ax.set_title(scenario)
    ax.set_xlabel("Estimated effect")
    ax.set_ylabel("")
    ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "05_weighted_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The weak-overlap panel should look less stable. The point sizes show how much data each trimmed estimator kept; trimming can reduce variance but changes the population being described.

Trimming Changes The Target Population

Trimming is not just a technical cleanup. It removes units from the tails of the propensity distribution, which often removes users with more extreme baseline profiles.

This cell summarizes how baseline covariates change after trimming.

trim_rows = []
for scenario, scenario_df in weighted_df.groupby("scenario"):
    for sample_name, mask in [
        ("full_sample", np.ones(len(scenario_df), dtype=bool)),
        ("trim_05_95", scenario_df["estimated_propensity"].between(0.05, 0.95).to_numpy()),
        ("trim_10_90", scenario_df["estimated_propensity"].between(0.10, 0.90).to_numpy()),
    ]:
        sample = scenario_df.loc[mask]
        trim_rows.append(
            {
                "scenario": scenario,
                "sample": sample_name,
                "rows": len(sample),
                "share_rows": len(sample) / len(scenario_df),
                "treatment_rate": sample["feature_exposure"].mean(),
                "mean_user_engagement": sample["user_engagement"].mean(),
                "mean_baseline_value": sample["baseline_value"].mean(),
                "mean_estimated_propensity": sample["estimated_propensity"].mean(),
            }
        )

trim_population_summary = pd.DataFrame(trim_rows)
trim_population_summary.to_csv(TABLE_DIR / "05_trimming_population_summary.csv", index=False)
trim_population_summary

	scenario	sample	rows	share_rows	treatment_rate	mean_user_engagement	mean_baseline_value	mean_estimated_propensity
0	usable_overlap	full_sample	5000	1.0000	0.467800	-0.006542	2.513019	0.467882
1	usable_overlap	trim_05_95	5000	1.0000	0.467800	-0.006542	2.513019	0.467882
2	usable_overlap	trim_10_90	4993	0.9986	0.467855	-0.004857	2.514952	0.468083
3	weak_overlap	full_sample	5000	1.0000	0.626000	0.010718	2.515658	0.626021
4	weak_overlap	trim_05_95	3260	0.6520	0.551840	-0.251799	2.142311	0.552169
5	weak_overlap	trim_10_90	2597	0.5194	0.535233	-0.279770	2.083192	0.534518

The trimmed sample may have lower or higher average baseline characteristics than the full sample. That means the trimmed estimate is often more stable but applies to a more comparable subpopulation.

DoWhy Weighting Under Usable And Weak Overlap

Now we run DoWhy’s backdoor.propensity_score_weighting estimator under the same graph in both scenarios. This connects the manual diagnostics to DoWhy’s estimator interface.

estimator_edges = [
    ("user_engagement", "feature_exposure"),
    ("user_engagement", "weekly_value"),
    ("prior_sessions", "feature_exposure"),
    ("prior_sessions", "weekly_value"),
    ("account_age_weeks", "feature_exposure"),
    ("account_age_weeks", "weekly_value"),
    ("is_power_user", "feature_exposure"),
    ("is_power_user", "weekly_value"),
    ("baseline_value", "feature_exposure"),
    ("baseline_value", "weekly_value"),
    ("feature_exposure", "weekly_value"),
]

def edges_to_dot(edges):
    lines = ["digraph {"]
    for source, target in edges:
        lines.append(f"    {source} -> {target};")
    lines.append("}")
    return chr(10).join(lines)

dowhy_graph = edges_to_dot(estimator_edges)

dowhy_rows = []
for scenario, scenario_df in overlap_df.groupby("scenario"):
    dowhy_data = scenario_df.drop(columns=["scenario", "true_propensity"]).copy()
    model = CausalModel(
        data=dowhy_data,
        treatment="feature_exposure",
        outcome="weekly_value",
        graph=dowhy_graph,
    )
    estimand = model.identify_effect(proceed_when_unidentifiable=True)
    for label, method_name in [
        ("dowhy_linear_regression", "backdoor.linear_regression"),
        ("dowhy_propensity_score_weighting", "backdoor.propensity_score_weighting"),
    ]:
        estimate = model.estimate_effect(estimand, method_name=method_name)
        dowhy_rows.append(
            {
                "scenario": scenario,
                "estimator": label,
                "method_name": method_name,
                "reported_common_causes": ", ".join(model.get_common_causes()),
                "estimate": float(estimate.value),
                "absolute_error_vs_true_ate": abs(float(estimate.value) - TRUE_ATE),
            }
        )

dowhy_weighting_comparison = pd.DataFrame(dowhy_rows)
dowhy_weighting_comparison.to_csv(TABLE_DIR / "05_dowhy_weighting_comparison.csv", index=False)
dowhy_weighting_comparison

	scenario	estimator	method_name	reported_common_causes	estimate	absolute_error_vs_true_ate
0	usable_overlap	dowhy_linear_regression	backdoor.linear_regression	is_power_user, baseline_value, user_engagement...	1.620507	0.020507
1	usable_overlap	dowhy_propensity_score_weighting	backdoor.propensity_score_weighting	is_power_user, baseline_value, user_engagement...	1.613681	0.013681
2	weak_overlap	dowhy_linear_regression	backdoor.linear_regression	is_power_user, baseline_value, user_engagement...	1.683087	0.083087
3	weak_overlap	dowhy_propensity_score_weighting	backdoor.propensity_score_weighting	is_power_user, baseline_value, user_engagement...	2.646674	1.046674

DoWhy uses the same graph in both scenarios. If the weighting estimate is less stable in the weak-overlap scenario, the issue is not graph identification; it is support and weighting fragility.

Plot DoWhy Estimates Against Manual Diagnostics

This plot shows DoWhy’s regression and weighting estimates next to the known ATE.

fig, ax = plt.subplots(figsize=(11, 5))
sns.scatterplot(
    data=dowhy_weighting_comparison,
    x="estimate",
    y="scenario",
    hue="estimator",
    s=95,
    ax=ax,
)
ax.axvline(TRUE_ATE, color="#111827", linestyle="--", linewidth=1.4, label="Known true ATE")
ax.set_title("DoWhy Regression And Weighting Estimates By Overlap Scenario")
ax.set_xlabel("Estimated effect")
ax.set_ylabel("")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "05_dowhy_weighting_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The DoWhy weighting estimate can drift in weak overlap for the same reason manual IPW drifts: extreme propensities create extreme influence. A clean API call does not remove the need for overlap diagnostics.

Practical Decision Guide

This table summarizes what to do when weighting diagnostics look good, borderline, or poor.

decision_guide = pd.DataFrame(
    [
        {
            "diagnostic_pattern": "Good overlap, small weights, improved balance",
            "reasonable_next_step": "Report IPW or normalized IPW alongside regression and balance diagnostics.",
            "caution": "Still depends on observed-confounding assumptions.",
        },
        {
            "diagnostic_pattern": "Moderate tails but acceptable effective sample size",
            "reasonable_next_step": "Compare clipped, stabilized, and normalized weights; report sensitivity to trimming.",
            "caution": "Make clear whether trimming changes the target population.",
        },
        {
            "diagnostic_pattern": "Extreme weights and small effective sample size",
            "reasonable_next_step": "Avoid relying on raw IPW alone; consider trimming, overlap weights, redesign, or narrower estimand.",
            "caution": "The full-population ATE may not be well supported by observed data.",
        },
        {
            "diagnostic_pattern": "Balance remains poor after weighting",
            "reasonable_next_step": "Revisit the propensity model and graph assumptions before interpreting the estimate.",
            "caution": "A weighted estimate without balance is not reassuring.",
        },
    ]
)

decision_guide.to_csv(TABLE_DIR / "05_weighting_decision_guide.csv", index=False)
decision_guide

	diagnostic_pattern	reasonable_next_step	caution
0	Good overlap, small weights, improved balance	Report IPW or normalized IPW alongside regress...	Still depends on observed-confounding assumpti...
1	Moderate tails but acceptable effective sample...	Compare clipped, stabilized, and normalized we...	Make clear whether trimming changes the target...
2	Extreme weights and small effective sample size	Avoid relying on raw IPW alone; consider trimm...	The full-population ATE may not be well suppor...
3	Balance remains poor after weighting	Revisit the propensity model and graph assumpt...	A weighted estimate without balance is not rea...

This is the practical mindset: weighting is only credible when the weighted comparison is both balanced and supported by enough observations.

Final Summary

This final table gives a compact report-ready summary of the notebook’s lessons.

usable_norm = weighted_estimates.query("scenario == 'usable_overlap' and estimator == 'normalized_ipw_clip01'")["estimate"].iloc[0]
weak_norm = weighted_estimates.query("scenario == 'weak_overlap' and estimator == 'normalized_ipw_clip01'")["estimate"].iloc[0]
usable_ess = weight_diagnostics.query("scenario == 'usable_overlap' and weight_column == 'ipw_weight_clip01'")["effective_sample_size"].iloc[0]
weak_ess = weight_diagnostics.query("scenario == 'weak_overlap' and weight_column == 'ipw_weight_clip01'")["effective_sample_size"].iloc[0]

final_summary = pd.DataFrame(
    [
        {"item": "Causal question", "summary": "Average effect of feature exposure on weekly value."},
        {"item": "Known true ATE", "summary": f"{TRUE_ATE:.3f}"},
        {"item": "Usable-overlap normalized IPW", "summary": f"{usable_norm:.3f}; effective sample size about {usable_ess:.0f}."},
        {"item": "Weak-overlap normalized IPW", "summary": f"{weak_norm:.3f}; effective sample size about {weak_ess:.0f}."},
        {"item": "Main diagnostic lesson", "summary": "Weak overlap creates large weights and makes weighting estimates more fragile."},
        {"item": "Trimming lesson", "summary": "Trimming can stabilize estimates but changes the population being described."},
        {"item": "DoWhy lesson", "summary": "DoWhy can estimate weighted effects, but overlap diagnostics remain the analyst's responsibility."},
        {"item": "Main limitation", "summary": "All weighting estimators still depend on measured common causes and adequate support."},
    ]
)

final_summary.to_csv(TABLE_DIR / "05_final_weighting_summary.csv", index=False)
final_summary

	item	summary
0	Causal question	Average effect of feature exposure on weekly v...
1	Known true ATE	1.600
2	Usable-overlap normalized IPW	1.613; effective sample size about 4328.
3	Weak-overlap normalized IPW	2.171; effective sample size about 1020.
4	Main diagnostic lesson	Weak overlap creates large weights and makes w...
5	Trimming lesson	Trimming can stabilize estimates but changes t...
6	DoWhy lesson	DoWhy can estimate weighted effects, but overl...
7	Main limitation	All weighting estimators still depend on measu...

The key point is not that weighting is good or bad. Weighting is useful when the data support the comparison. When overlap is weak, the responsible answer may be a narrower estimand, a trimmed population, or a redesign of the analysis.

Student Exercises

Try these after running the notebook:

Increase treatment_selection_strength in the weak-overlap dataset and watch the effective sample size fall.
Change the clipping thresholds from [0.01, 0.99] to [0.02, 0.98] and compare estimates.
Compare trimming bands [0.05, 0.95], [0.10, 0.90], and [0.20, 0.80].
Add a nonlinear term to the propensity model and see whether balance improves.
Remove one confounder from the propensity model and inspect both balance and bias.
Write a short stakeholder summary explaining why a full-population ATE may not be supported under weak overlap.

Closing Notes

This notebook showed that inverse propensity weighting is not just a formula. It requires overlap, reasonable weights, adequate effective sample size, and improved covariate balance. The next notebook will move beyond backdoor weighting and introduce frontdoor, instrumental-variable, and natural-experiment logic.