DoWhy Tutorial 04: Regression, Matching, And Propensity Estimators

This notebook compares several estimators for the same identified causal estimand. The goal is not to crown one estimator as universally best. The goal is to understand what each estimator is doing, what assumptions it leans on, and why estimates can differ even when the causal graph is unchanged.

We will use one observational teaching dataset, one causal graph, and one target estimand: the average total effect of feature_exposure on weekly_value. Then we will estimate that target with outcome regression, propensity-score matching, propensity-score stratification, and propensity-score weighting.

Learning Goals

By the end of this notebook, you should be able to:

Explain the difference between an estimand and an estimator.
Run manual versions of regression, matching, stratification, and weighting estimators.
Run the corresponding DoWhy estimators for a common backdoor estimand.
Diagnose basic estimator behavior using propensity overlap, covariate balance, and estimate stability.
Explain why estimator agreement is useful but does not prove the causal graph is correct.

Estimand Versus Estimator

The estimand is the causal quantity implied by the graph and assumptions. In this notebook, the estimand is the average treatment effect after adjusting for observed common causes.

The estimator is the statistical method used to compute that estimand from finite data. Regression, matching, stratification, and weighting can all target the same estimand, but they do so with different statistical machinery.

Setup

This setup cell imports the packages used in the notebook, creates output folders, fixes a random seed, and suppresses known third-party compatibility warnings. The warning policy keeps expected library chatter out of the student-facing notebook while preserving real execution errors.

from pathlib import Path
import os
import platform
import sys
import warnings

START_DIR = Path.cwd().resolve()
PROJECT_ROOT = next(
    (candidate for candidate in [START_DIR, *START_DIR.parents] if (candidate / "pyproject.toml").exists()),
    START_DIR,
)

NOTEBOOK_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "dowhy"
OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
CACHE_DIR = PROJECT_ROOT / ".cache" / "matplotlib"

for directory in [OUTPUT_DIR, FIGURE_DIR, TABLE_DIR, CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(CACHE_DIR))

warnings.filterwarnings("default")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=PendingDeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*IProgress not found.*")
warnings.filterwarnings("ignore", message=".*setParseAction.*deprecated.*")
warnings.filterwarnings("ignore", message=".*copy keyword is deprecated.*")
warnings.filterwarnings("ignore", message=".*disp.*iprint.*L-BFGS-B.*")
warnings.filterwarnings("ignore", module="dowhy.causal_estimators.regression_estimator")
warnings.filterwarnings("ignore", module="sklearn.linear_model._logistic")
warnings.filterwarnings("ignore", module="seaborn.categorical")
warnings.filterwarnings("ignore", module="pydot.dot_parser")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import statsmodels.formula.api as smf

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import dowhy
from dowhy import CausalModel

RANDOM_SEED = 41
rng = np.random.default_rng(RANDOM_SEED)

sns.set_theme(style="whitegrid", context="notebook")

print(f"Python executable: {sys.executable}")
print(f"Python version: {platform.python_version()}")
print(f"DoWhy version: {getattr(dowhy, '__version__', 'unknown')}")
print(f"Notebook directory: {NOTEBOOK_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Python executable: /home/apex/Documents/ranking_sys/.venv/bin/python3
Python version: 3.13.12
DoWhy version: 0.14
Notebook directory: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy
Output directory: /home/apex/Documents/ranking_sys/notebooks/tutorials/dowhy/outputs

The notebook is ready if this cell prints a DoWhy version. All generated artifacts from this notebook use a 04_ prefix.

Estimator Map

Before creating data, it helps to name the estimator families. This table gives a student-friendly summary of what each method is trying to do.

estimator_map = pd.DataFrame(
    [
        {
            "estimator_family": "Outcome regression",
            "basic_idea": "Model the outcome as a function of treatment and confounders.",
            "main_strength": "Simple and interpretable when the outcome model is credible.",
            "main_failure_mode": "Can be biased if the outcome model is misspecified.",
        },
        {
            "estimator_family": "Propensity-score matching",
            "basic_idea": "Compare treated and untreated units with similar treatment probabilities.",
            "main_strength": "Makes treated-control comparability intuitive.",
            "main_failure_mode": "Can be noisy or biased when matches are poor or overlap is weak.",
        },
        {
            "estimator_family": "Propensity-score stratification",
            "basic_idea": "Compare treated and untreated units within propensity-score strata.",
            "main_strength": "Easy to inspect and explain.",
            "main_failure_mode": "Sensitive to bin choices and sparse cells.",
        },
        {
            "estimator_family": "Propensity-score weighting",
            "basic_idea": "Reweight units by inverse treatment probability to reduce observed confounding.",
            "main_strength": "Targets population-level effects directly when propensities are reliable.",
            "main_failure_mode": "Extreme weights can create high variance and fragile estimates.",
        },
    ]
)

estimator_map.to_csv(TABLE_DIR / "04_estimator_map.csv", index=False)
estimator_map

	estimator_family	basic_idea	main_strength	main_failure_mode
0	Outcome regression	Model the outcome as a function of treatment a...	Simple and interpretable when the outcome mode...	Can be biased if the outcome model is misspeci...
1	Propensity-score matching	Compare treated and untreated units with simil...	Makes treated-control comparability intuitive.	Can be noisy or biased when matches are poor o...
2	Propensity-score stratification	Compare treated and untreated units within pro...	Easy to inspect and explain.	Sensitive to bin choices and sparse cells.
3	Propensity-score weighting	Reweight units by inverse treatment probabilit...	Targets population-level effects directly when...	Extreme weights can create high variance and f...

All four methods still rely on the same causal assumption: after adjusting for observed common causes, treatment is as-if randomized. The estimator does not rescue a wrong graph.

Causal Question And Variable Roles

The teaching question is:

What is the average total effect of feature_exposure on weekly_value?

The variables below are all pre-treatment common causes except the treatment, outcome, and simulation-only treatment probability.

role_table = pd.DataFrame(
    [
        {"variable": "feature_exposure", "role": "treatment", "timing": "treatment time", "used_for_adjustment": "no; this is the treatment"},
        {"variable": "weekly_value", "role": "outcome", "timing": "future outcome window", "used_for_adjustment": "no; this is the outcome"},
        {"variable": "user_engagement", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"},
        {"variable": "prior_sessions", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"},
        {"variable": "account_age_weeks", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"},
        {"variable": "is_power_user", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"},
        {"variable": "baseline_value", "role": "observed common cause", "timing": "pre-treatment", "used_for_adjustment": "yes"},
        {"variable": "treatment_probability", "role": "simulation diagnostic", "timing": "known only because this is simulated", "used_for_adjustment": "no; not a real observed column"},
    ]
)

role_table.to_csv(TABLE_DIR / "04_variable_roles.csv", index=False)
role_table

	variable	role	timing	used_for_adjustment
0	feature_exposure	treatment	treatment time	no; this is the treatment
1	weekly_value	outcome	future outcome window	no; this is the outcome
2	user_engagement	observed common cause	pre-treatment	yes
3	prior_sessions	observed common cause	pre-treatment	yes
4	account_age_weeks	observed common cause	pre-treatment	yes
5	is_power_user	observed common cause	pre-treatment	yes
6	baseline_value	observed common cause	pre-treatment	yes
7	treatment_probability	simulation diagnostic	known only because this is simulated	no; not a real observed column

The adjustment set is deliberately straightforward: all adjustment variables are pre-treatment common causes. This lets the notebook focus on estimator mechanics rather than bad-control mistakes.

Create A Teaching Dataset

This dataset is observational: treatment assignment depends on baseline variables. The true effect is constant and known because this is a simulation. That gives us a benchmark for comparing estimators.

def make_estimator_teaching_data(n=4_500, seed=41):
    local_rng = np.random.default_rng(seed)

    user_engagement = local_rng.normal(loc=0.0, scale=1.0, size=n)
    prior_sessions = np.clip(
        local_rng.poisson(lam=np.exp(1.0 + 0.23 * user_engagement), size=n),
        0,
        30,
    )
    account_age_weeks = local_rng.gamma(shape=2.2, scale=3.5, size=n)
    is_power_user = local_rng.binomial(
        n=1,
        p=1 / (1 + np.exp(-(0.70 * user_engagement - 0.20))),
        size=n,
    )
    baseline_value = (
        2.0
        + 1.00 * user_engagement
        + 0.050 * prior_sessions
        + 0.020 * account_age_weeks
        + 0.550 * is_power_user
        + local_rng.normal(loc=0.0, scale=1.0, size=n)
    )

    treatment_logit = (
        -0.65
        + 0.80 * user_engagement
        + 0.040 * prior_sessions
        + 0.200 * baseline_value
        + 0.450 * is_power_user
        - 0.015 * account_age_weeks
    )
    treatment_probability = 1 / (1 + np.exp(-treatment_logit))
    feature_exposure = local_rng.binomial(n=1, p=treatment_probability, size=n)

    true_ate = 1.65
    weekly_value = (
        4.0
        + true_ate * feature_exposure
        + 1.15 * user_engagement
        + 0.055 * prior_sessions
        + 0.030 * account_age_weeks
        + 0.550 * baseline_value
        + 0.450 * is_power_user
        + local_rng.normal(loc=0.0, scale=1.15, size=n)
    )

    df = pd.DataFrame(
        {
            "feature_exposure": feature_exposure,
            "weekly_value": weekly_value,
            "user_engagement": user_engagement,
            "prior_sessions": prior_sessions,
            "account_age_weeks": account_age_weeks,
            "is_power_user": is_power_user,
            "baseline_value": baseline_value,
            "treatment_probability": treatment_probability,
        }
    )
    return df, true_ate

estimator_df, TRUE_ATE = make_estimator_teaching_data()
estimator_df.to_csv(TABLE_DIR / "04_estimator_teaching_dataset.csv", index=False)

print(f"Rows: {len(estimator_df):,}")
print(f"Known true ATE: {TRUE_ATE:.4f}")
estimator_df.head()

Rows: 4,500
Known true ATE: 1.6500

	feature_exposure	weekly_value	user_engagement	prior_sessions	account_age_weeks	is_power_user	baseline_value	treatment_probability
0	0	2.012963	-1.231665	2	11.291630	1	0.390388	0.232075
1	0	5.894137	0.267119	4	8.218280	0	2.583502	0.529247
2	0	5.865394	-0.006926	4	19.840561	0	1.635882	0.385570
3	1	8.297606	0.501535	3	3.476287	0	4.670994	0.679894
4	1	6.857627	-1.326728	6	15.562224	0	2.764735	0.240148

The first rows show the treatment, outcome, common causes, and known assignment probability. In real data, the true treatment probability would usually be unknown, so we will also estimate a propensity model.

Basic Dataset Checks

Before fitting estimators, inspect treatment prevalence, outcome scale, covariate ranges, and the treatment-probability range.

basic_summary = estimator_df.agg(
    {
        "feature_exposure": ["mean", "sum"],
        "weekly_value": ["mean", "std", "min", "max"],
        "user_engagement": ["mean", "std", "min", "max"],
        "prior_sessions": ["mean", "std", "min", "max"],
        "account_age_weeks": ["mean", "std", "min", "max"],
        "is_power_user": ["mean", "sum"],
        "baseline_value": ["mean", "std", "min", "max"],
        "treatment_probability": ["mean", "std", "min", "max"],
    }
).T

basic_summary.to_csv(TABLE_DIR / "04_dataset_summary.csv")
basic_summary

	mean	sum	std	min	max
feature_exposure	0.511556	2302.0	NaN	NaN	NaN
weekly_value	6.841887	NaN	2.717787	-2.672389	15.225523
user_engagement	0.002319	NaN	0.987011	-3.678627	3.688559
prior_sessions	2.757333	NaN	1.761446	0.000000	13.000000
account_age_weeks	7.611833	NaN	5.280117	0.151001	63.187628
is_power_user	0.452667	2037.0	NaN	NaN	NaN
baseline_value	2.546050	NaN	1.532906	-2.954430	7.818962
treatment_probability	0.510979	NaN	0.232225	0.019354	0.985849

The treatment rate is close enough to balanced for all four estimator families to be meaningful. The treatment-probability range shows some tails, which is useful for teaching weighting and matching behavior.

Confirm Observed Confounding

This plot shows that treatment assignment is not random. Exposed users tend to have different baseline engagement and baseline value.

fig, axes = plt.subplots(1, 3, figsize=(16, 4.8))

sns.histplot(
    data=estimator_df,
    x="treatment_probability",
    hue="feature_exposure",
    bins=45,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=axes[0],
)
axes[0].set_title("Known Treatment Probability")
axes[0].set_xlabel("Treatment probability")
axes[0].set_ylabel("Density")

sns.violinplot(data=estimator_df, x="feature_exposure", y="user_engagement", inner="quartile", cut=0, ax=axes[1])
axes[1].set_title("Engagement Imbalance")
axes[1].set_xlabel("Feature exposure")
axes[1].set_ylabel("User engagement")

sns.violinplot(data=estimator_df, x="feature_exposure", y="baseline_value", inner="quartile", cut=0, ax=axes[2])
axes[2].set_title("Baseline Value Imbalance")
axes[2].set_xlabel("Feature exposure")
axes[2].set_ylabel("Baseline value")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_assignment_and_confounding_checks.png", dpi=160, bbox_inches="tight")
plt.show()

The treated and untreated groups differ before treatment. That is why all estimators in this notebook target an adjusted effect rather than a raw difference in means.

Baseline Imbalance Table

Standardized mean differences summarize baseline imbalance in standard-deviation units. We will use the same metric later to compare raw and weighted balance.

confounder_cols = ["user_engagement", "prior_sessions", "account_age_weeks", "is_power_user", "baseline_value"]

def weighted_mean(values, weights):
    values = np.asarray(values)
    weights = np.asarray(weights)
    return np.sum(values * weights) / np.sum(weights)

def weighted_var(values, weights):
    values = np.asarray(values)
    weights = np.asarray(weights)
    mean = weighted_mean(values, weights)
    return np.sum(weights * (values - mean) ** 2) / np.sum(weights)

def standardized_mean_difference(df, column, treatment_col="feature_exposure", weights=None):
    treated_mask = df[treatment_col].to_numpy() == 1
    values = df[column].to_numpy()
    if weights is None:
        treated_values = values[treated_mask]
        control_values = values[~treated_mask]
        pooled_sd = np.sqrt((treated_values.var(ddof=1) + control_values.var(ddof=1)) / 2)
        return (treated_values.mean() - control_values.mean()) / pooled_sd

    weights = np.asarray(weights)
    treated_mean = weighted_mean(values[treated_mask], weights[treated_mask])
    control_mean = weighted_mean(values[~treated_mask], weights[~treated_mask])
    treated_var = weighted_var(values[treated_mask], weights[treated_mask])
    control_var = weighted_var(values[~treated_mask], weights[~treated_mask])
    pooled_sd = np.sqrt((treated_var + control_var) / 2)
    return (treated_mean - control_mean) / pooled_sd

raw_balance = pd.DataFrame(
    [
        {"covariate": column, "raw_smd": standardized_mean_difference(estimator_df, column)}
        for column in confounder_cols
    ]
).sort_values("raw_smd", key=lambda s: s.abs(), ascending=False)

raw_balance.to_csv(TABLE_DIR / "04_raw_covariate_balance.csv", index=False)
raw_balance

	covariate	raw_smd
0	user_engagement	0.973071
4	baseline_value	0.830520
3	is_power_user	0.471878
1	prior_sessions	0.390748
2	account_age_weeks	-0.038578

Large absolute values mean exposed and unexposed users differ on that covariate. Matching, stratification, and weighting are all different attempts to reduce the consequences of this imbalance.

Naive Difference And Outcome Regression

We start with two simple baselines: the naive treated-minus-control difference and an adjusted regression. The adjusted regression controls for the observed common causes named in the graph.

naive_effect = (
    estimator_df.loc[estimator_df["feature_exposure"] == 1, "weekly_value"].mean()
    - estimator_df.loc[estimator_df["feature_exposure"] == 0, "weekly_value"].mean()
)

regression_formula = (
    "weekly_value ~ feature_exposure + user_engagement + prior_sessions "
    "+ account_age_weeks + is_power_user + baseline_value"
)
regression_fit = smf.ols(formula=regression_formula, data=estimator_df).fit()
regression_effect = regression_fit.params["feature_exposure"]
regression_ci = regression_fit.conf_int().loc["feature_exposure"].to_numpy()

regression_baseline = pd.DataFrame(
    [
        {"estimator": "known_true_ate", "estimate": TRUE_ATE, "ci_95_lower": np.nan, "ci_95_upper": np.nan, "description": "Known only because this is a teaching simulation."},
        {"estimator": "naive_difference_in_means", "estimate": naive_effect, "ci_95_lower": np.nan, "ci_95_upper": np.nan, "description": "Unadjusted association; confounding is not blocked."},
        {"estimator": "adjusted_outcome_regression", "estimate": regression_effect, "ci_95_lower": regression_ci[0], "ci_95_upper": regression_ci[1], "description": "Backdoor adjustment through a linear outcome model."},
    ]
)

regression_baseline.to_csv(TABLE_DIR / "04_regression_baseline_estimates.csv", index=False)
regression_baseline

	estimator	estimate	ci_95_lower	ci_95_upper	description
0	known_true_ate	1.650000	NaN	NaN	Known only because this is a teaching simulation.
1	naive_difference_in_means	3.442531	NaN	NaN	Unadjusted association; confounding is not blo...
2	adjusted_outcome_regression	1.675725	1.600592	1.750857	Backdoor adjustment through a linear outcome m...

The naive difference is inflated because exposed users are stronger at baseline. The adjusted regression moves much closer to the known ATE because it blocks the observed backdoor paths.

Estimate Propensity Scores

Propensity-score methods first model the probability of treatment given observed common causes. The propensity score compresses multiple confounders into one balancing score.

This cell estimates propensities with logistic regression. We clip very small or very large estimated probabilities to keep manual weighting stable for teaching.

X = estimator_df[confounder_cols]
y_treatment = estimator_df["feature_exposure"]

propensity_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1_000, random_state=RANDOM_SEED),
)
propensity_model.fit(X, y_treatment)

estimated_propensity = propensity_model.predict_proba(X)[:, 1]
clipped_propensity = np.clip(estimated_propensity, 0.02, 0.98)

scored_df = estimator_df.copy()
scored_df["estimated_propensity"] = estimated_propensity
scored_df["clipped_propensity"] = clipped_propensity

propensity_summary = pd.DataFrame(
    [
        {"metric": "propensity_model_auc", "value": roc_auc_score(y_treatment, estimated_propensity)},
        {"metric": "estimated_propensity_min", "value": estimated_propensity.min()},
        {"metric": "estimated_propensity_max", "value": estimated_propensity.max()},
        {"metric": "clipped_propensity_min", "value": clipped_propensity.min()},
        {"metric": "clipped_propensity_max", "value": clipped_propensity.max()},
    ]
)

propensity_summary.to_csv(TABLE_DIR / "04_propensity_model_summary.csv", index=False)
propensity_summary

	metric	value
0	propensity_model_auc	0.765052
1	estimated_propensity_min	0.019625
2	estimated_propensity_max	0.984895
3	clipped_propensity_min	0.020000
4	clipped_propensity_max	0.980000

A high AUC means the observed covariates are predictive of treatment. That is not automatically good or bad; it tells us treatment was not random. Very extreme propensities would be a warning for matching and weighting.

Propensity Overlap

This plot shows whether treated and untreated users occupy similar regions of the estimated propensity-score distribution. Strong overlap makes propensity methods more credible; weak overlap means some units have poor comparisons.

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(
    data=scored_df,
    x="estimated_propensity",
    hue="feature_exposure",
    bins=50,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=ax,
)
ax.set_title("Estimated Propensity Score Overlap")
ax.set_xlabel("Estimated propensity score")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The distributions overlap enough for this teaching example, but the tails are worth noticing. The next notebook will go deeper on overlap, common support, and extreme weights.

Manual Matching Estimator

Matching creates treated-control comparisons among units with similar propensity scores. Here we use one-nearest-neighbor matching on the estimated propensity score.

To approximate an ATE rather than only an ATT, we match treated units to controls and controls to treated units, then average the implied individual contrasts.

treated_df = scored_df.loc[scored_df["feature_exposure"] == 1].copy()
control_df = scored_df.loc[scored_df["feature_exposure"] == 0].copy()

treated_ps = treated_df[["estimated_propensity"]].to_numpy()
control_ps = control_df[["estimated_propensity"]].to_numpy()

control_matcher = NearestNeighbors(n_neighbors=1)
control_matcher.fit(control_ps)
treated_to_control_distance, treated_to_control_index = control_matcher.kneighbors(treated_ps)
matched_control_outcome = control_df["weekly_value"].to_numpy()[treated_to_control_index[:, 0]]
att_estimate = (treated_df["weekly_value"].to_numpy() - matched_control_outcome).mean()

treated_matcher = NearestNeighbors(n_neighbors=1)
treated_matcher.fit(treated_ps)
control_to_treated_distance, control_to_treated_index = treated_matcher.kneighbors(control_ps)
matched_treated_outcome = treated_df["weekly_value"].to_numpy()[control_to_treated_index[:, 0]]
atc_estimate = (matched_treated_outcome - control_df["weekly_value"].to_numpy()).mean()

manual_matching_ate = scored_df["feature_exposure"].mean() * att_estimate + (1 - scored_df["feature_exposure"].mean()) * atc_estimate

matching_summary = pd.DataFrame(
    [
        {"quantity": "ATT from treated-to-control matching", "estimate": att_estimate},
        {"quantity": "ATC from control-to-treated matching", "estimate": atc_estimate},
        {"quantity": "ATE weighted from ATT and ATC", "estimate": manual_matching_ate},
        {"quantity": "Mean treated-to-control propensity distance", "estimate": treated_to_control_distance.mean()},
        {"quantity": "Mean control-to-treated propensity distance", "estimate": control_to_treated_distance.mean()},
    ]
)

matching_summary.to_csv(TABLE_DIR / "04_manual_matching_summary.csv", index=False)
matching_summary

	quantity	estimate
0	ATT from treated-to-control matching	1.725263
1	ATC from control-to-treated matching	1.739918
2	ATE weighted from ATT and ATC	1.732422
3	Mean treated-to-control propensity distance	0.000656
4	Mean control-to-treated propensity distance	0.000495

The matching estimate should be much closer to the true effect than the naive difference. The match-distance rows are simple diagnostics: smaller distances mean treated and untreated units were compared at more similar propensities.

Manual Stratification Estimator

Stratification divides the sample into propensity-score bins, compares treated and untreated users inside each bin, and averages those within-bin differences.

This is a useful bridge between matching and weighting because it makes the balancing logic very visible.

stratified_df = scored_df.copy()
stratified_df["propensity_stratum"] = pd.qcut(
    stratified_df["estimated_propensity"],
    q=10,
    duplicates="drop",
)

stratum_rows = []
for stratum, group in stratified_df.groupby("propensity_stratum", observed=True):
    treated_group = group.loc[group["feature_exposure"] == 1]
    control_group = group.loc[group["feature_exposure"] == 0]
    if len(treated_group) == 0 or len(control_group) == 0:
        continue
    effect = treated_group["weekly_value"].mean() - control_group["weekly_value"].mean()
    stratum_rows.append(
        {
            "stratum": str(stratum),
            "rows": len(group),
            "treated_rows": len(treated_group),
            "control_rows": len(control_group),
            "mean_propensity": group["estimated_propensity"].mean(),
            "within_stratum_effect": effect,
        }
    )

stratum_effects = pd.DataFrame(stratum_rows)
manual_stratification_ate = np.average(stratum_effects["within_stratum_effect"], weights=stratum_effects["rows"])

stratum_effects.to_csv(TABLE_DIR / "04_stratification_effects_by_stratum.csv", index=False)
print(f"Manual stratification ATE: {manual_stratification_ate:.4f}")
stratum_effects

Manual stratification ATE: 1.7656

	stratum	rows	treated_rows	control_rows	mean_propensity	within_stratum_effect
0	(0.0186, 0.197]	450	57	393	0.137476	2.065986
1	(0.197, 0.284]	450	111	339	0.241153	1.683068
2	(0.284, 0.362]	450	151	299	0.321175	1.436624
3	(0.362, 0.44]	450	179	271	0.400964	1.465293
4	(0.44, 0.513]	450	216	234	0.476063	1.782675
5	(0.513, 0.586]	450	250	200	0.548276	1.844546
6	(0.586, 0.659]	450	281	169	0.622910	1.614906
7	(0.659, 0.742]	450	305	145	0.700100	1.792125
8	(0.742, 0.828]	450	349	101	0.784780	1.706804
9	(0.828, 0.985]	450	403	47	0.882742	2.264248

Each stratum creates a local treated-versus-control comparison among units with similar estimated propensities. If a stratum has very few treated or control units, the estimate for that stratum becomes fragile.

Manual Weighting Estimator

Inverse propensity weighting gives each observed unit a weight based on how surprising its treatment status was. Treated units get weight 1 / e(X), and untreated units get weight 1 / (1 - e(X)).

This cell computes both a plain IPW ATE and a normalized, or self-normalized, IPW estimate.

treatment = scored_df["feature_exposure"].to_numpy()
outcome = scored_df["weekly_value"].to_numpy()
propensity = scored_df["clipped_propensity"].to_numpy()

ipw_weights = treatment / propensity + (1 - treatment) / (1 - propensity)
scored_df["ipw_weight"] = ipw_weights

manual_ipw_ate = np.mean(treatment * outcome / propensity - (1 - treatment) * outcome / (1 - propensity))
manual_normalized_ipw_ate = (
    np.sum(treatment * outcome / propensity) / np.sum(treatment / propensity)
    - np.sum((1 - treatment) * outcome / (1 - propensity)) / np.sum((1 - treatment) / (1 - propensity))
)

weight_summary = pd.DataFrame(
    [
        {"metric": "manual_ipw_ate", "value": manual_ipw_ate},
        {"metric": "manual_normalized_ipw_ate", "value": manual_normalized_ipw_ate},
        {"metric": "mean_ipw_weight", "value": ipw_weights.mean()},
        {"metric": "max_ipw_weight", "value": ipw_weights.max()},
        {"metric": "p99_ipw_weight", "value": np.quantile(ipw_weights, 0.99)},
    ]
)

weight_summary.to_csv(TABLE_DIR / "04_manual_weighting_summary.csv", index=False)
weight_summary

	metric	value
0	manual_ipw_ate	1.802791
1	manual_normalized_ipw_ate	1.753315
2	mean_ipw_weight	1.992565
3	max_ipw_weight	24.004464
4	p99_ipw_weight	7.173259

The plain IPW and normalized IPW estimates can differ because normalization reduces sensitivity to weight scale. The maximum and p99 weights are early warnings about variance; the next notebook will inspect weights more deeply.

Covariate Balance After Weighting

A weighting estimator should reduce imbalance in observed confounders. This table compares raw standardized mean differences to IPW-weighted standardized mean differences.

weighted_balance = pd.DataFrame(
    [
        {
            "covariate": column,
            "raw_smd": standardized_mean_difference(scored_df, column),
            "ipw_weighted_smd": standardized_mean_difference(scored_df, column, weights=scored_df["ipw_weight"].to_numpy()),
        }
        for column in confounder_cols
    ]
)
weighted_balance["absolute_raw_smd"] = weighted_balance["raw_smd"].abs()
weighted_balance["absolute_weighted_smd"] = weighted_balance["ipw_weighted_smd"].abs()
weighted_balance = weighted_balance.sort_values("absolute_raw_smd", ascending=False)

weighted_balance.to_csv(TABLE_DIR / "04_weighted_covariate_balance.csv", index=False)
weighted_balance

	covariate	raw_smd	ipw_weighted_smd	absolute_raw_smd	absolute_weighted_smd
0	user_engagement	0.973071	0.022206	0.973071	0.022206
4	baseline_value	0.830520	0.023212	0.830520	0.023212
3	is_power_user	0.471878	0.009249	0.471878	0.009249
1	prior_sessions	0.390748	0.003591	0.390748	0.003591
2	account_age_weeks	-0.038578	-0.017259	0.038578	0.017259

The weighted SMDs should be closer to zero than the raw SMDs. That is the practical balancing purpose of propensity weighting.

Visualize Balance Before And After Weighting

This plot shows whether inverse propensity weighting reduced covariate imbalance.

balance_plot_df = weighted_balance.melt(
    id_vars="covariate",
    value_vars=["raw_smd", "ipw_weighted_smd"],
    var_name="balance_type",
    value_name="smd",
)

fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(
    data=balance_plot_df,
    x="smd",
    y="covariate",
    hue="balance_type",
    ax=ax,
)
ax.axvline(0, color="#111827", linewidth=1)
ax.axvline(0.1, color="#9ca3af", linestyle="--", linewidth=1)
ax.axvline(-0.1, color="#9ca3af", linestyle="--", linewidth=1)
ax.set_title("Covariate Balance Before And After IPW")
ax.set_xlabel("Standardized mean difference")
ax.set_ylabel("")
ax.legend(title="Balance type")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_weighted_covariate_balance.png", dpi=160, bbox_inches="tight")
plt.show()

The weighted bars should shrink toward zero. If they did not, we would revisit the propensity model, overlap, or the causal design itself.

Manual Estimator Comparison

This table gathers the manual estimators into one place and compares them to the known true ATE.

manual_estimator_comparison = pd.DataFrame(
    [
        {"estimator": "known_true_ate", "estimate": TRUE_ATE, "notes": "Simulation benchmark."},
        {"estimator": "naive_difference_in_means", "estimate": naive_effect, "notes": "No adjustment; confounded association."},
        {"estimator": "adjusted_outcome_regression", "estimate": regression_effect, "notes": "Outcome model with observed common causes."},
        {"estimator": "propensity_score_matching", "estimate": manual_matching_ate, "notes": "Nearest-neighbor matching on estimated propensity."},
        {"estimator": "propensity_score_stratification", "estimate": manual_stratification_ate, "notes": "Weighted average of within-stratum effects."},
        {"estimator": "ipw_weighting", "estimate": manual_ipw_ate, "notes": "Plain inverse propensity weighted ATE."},
        {"estimator": "normalized_ipw_weighting", "estimate": manual_normalized_ipw_ate, "notes": "Self-normalized inverse propensity weighted ATE."},
    ]
)
manual_estimator_comparison["absolute_error_vs_true_ate"] = (manual_estimator_comparison["estimate"] - TRUE_ATE).abs()

manual_estimator_comparison.to_csv(TABLE_DIR / "04_manual_estimator_comparison.csv", index=False)
manual_estimator_comparison

	estimator	estimate	notes	absolute_error_vs_true_ate
0	known_true_ate	1.650000	Simulation benchmark.	0.000000
1	naive_difference_in_means	3.442531	No adjustment; confounded association.	1.792531
2	adjusted_outcome_regression	1.675725	Outcome model with observed common causes.	0.025725
3	propensity_score_matching	1.732422	Nearest-neighbor matching on estimated propens...	0.082422
4	propensity_score_stratification	1.765627	Weighted average of within-stratum effects.	0.115627
5	ipw_weighting	1.802791	Plain inverse propensity weighted ATE.	0.152791
6	normalized_ipw_weighting	1.753315	Self-normalized inverse propensity weighted ATE.	0.103315

The adjusted estimators should be much closer to the known ATE than the naive comparison. They will not be identical because each method uses a different finite-sample approximation.

Plot Manual Estimator Comparison

This plot makes the estimator comparison easier to scan. The dashed vertical line marks the known ATE.

plot_manual = manual_estimator_comparison.query("estimator != 'known_true_ate'").copy()

fig, ax = plt.subplots(figsize=(11, 5.5))
sns.scatterplot(
    data=plot_manual,
    x="estimate",
    y="estimator",
    s=90,
    color="#2563eb",
    ax=ax,
)
ax.axvline(TRUE_ATE, color="#111827", linestyle="--", linewidth=1.4, label="Known true ATE")
ax.set_title("Manual Estimator Comparison")
ax.set_xlabel("Estimated effect on weekly_value")
ax.set_ylabel("")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_manual_estimator_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

Estimator disagreement is not automatically a failure. It is a diagnostic prompt: inspect overlap, model specification, matching quality, and whether all estimators are really targeting the same population estimand.

Create The DoWhy Causal Graph

Now we move from manual estimators to DoWhy estimators. The graph states that all baseline variables are common causes of treatment and outcome.

estimator_edges = [
    ("user_engagement", "feature_exposure"),
    ("user_engagement", "weekly_value"),
    ("prior_sessions", "feature_exposure"),
    ("prior_sessions", "weekly_value"),
    ("account_age_weeks", "feature_exposure"),
    ("account_age_weeks", "weekly_value"),
    ("is_power_user", "feature_exposure"),
    ("is_power_user", "weekly_value"),
    ("baseline_value", "feature_exposure"),
    ("baseline_value", "weekly_value"),
    ("feature_exposure", "weekly_value"),
]

def edges_to_dot(edges):
    lines = ["digraph {"]
    for source, target in edges:
        lines.append(f"    {source} -> {target};")
    lines.append("}")
    return chr(10).join(lines)

estimator_graph = edges_to_dot(estimator_edges)
print(estimator_graph)

digraph {
    user_engagement -> feature_exposure;
    user_engagement -> weekly_value;
    prior_sessions -> feature_exposure;
    prior_sessions -> weekly_value;
    account_age_weeks -> feature_exposure;
    account_age_weeks -> weekly_value;
    is_power_user -> feature_exposure;
    is_power_user -> weekly_value;
    baseline_value -> feature_exposure;
    baseline_value -> weekly_value;
    feature_exposure -> weekly_value;
}

This is the same backdoor graph for every DoWhy estimator below. The graph and estimand stay fixed; only the statistical estimator changes.

Visualize The Estimator Graph

This graph is intentionally simpler than the previous notebook’s graph. Every baseline covariate is a common cause, and the treatment points to the outcome.

node_positions = {
    "user_engagement": (0.08, 0.88),
    "prior_sessions": (0.08, 0.68),
    "account_age_weeks": (0.08, 0.48),
    "is_power_user": (0.08, 0.28),
    "baseline_value": (0.08, 0.08),
    "feature_exposure": (0.52, 0.50),
    "weekly_value": (0.90, 0.50),
}

node_labels = {
    "user_engagement": "User\nengagement",
    "prior_sessions": "Prior\nsessions",
    "account_age_weeks": "Account age\nweeks",
    "is_power_user": "Power\nuser",
    "baseline_value": "Baseline\nvalue",
    "feature_exposure": "Feature\nexposure",
    "weekly_value": "Weekly\nvalue",
}

node_colors = {node: "#eef2ff" for node in node_positions}
node_colors["feature_exposure"] = "#e0f2fe"
node_colors["weekly_value"] = "#dcfce7"

fig, ax = plt.subplots(figsize=(12, 6.5))
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_axis_off()

for source, target in estimator_edges:
    rad = 0.04 if target == "feature_exposure" else -0.06 if source != "feature_exposure" else 0.0
    ax.annotate(
        "",
        xy=node_positions[target],
        xytext=node_positions[source],
        arrowprops=dict(
            arrowstyle="-|>",
            color="#334155",
            linewidth=1.25,
            mutation_scale=15,
            shrinkA=32,
            shrinkB=32,
            connectionstyle=f"arc3,rad={rad}",
        ),
        zorder=1,
    )

for node, (x, y) in node_positions.items():
    ax.text(
        x,
        y,
        node_labels[node],
        ha="center",
        va="center",
        fontsize=10,
        fontweight="bold",
        bbox=dict(boxstyle="round,pad=0.44", facecolor=node_colors[node], edgecolor="#334155", linewidth=1.1),
        zorder=2,
    )

ax.set_title("Common Backdoor Graph For Estimator Comparison", pad=18)
fig.savefig(FIGURE_DIR / "04_estimator_comparison_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The graph clarifies why all estimators should adjust for the same baseline variables. If the graph changed, the estimand could change too.

Identify The Common Estimand With DoWhy

This cell creates a DoWhy CausalModel and identifies the backdoor estimand. We inspect the common causes before estimating.

dowhy_df = estimator_df.drop(columns=["treatment_probability"]).copy()

estimator_model = CausalModel(
    data=dowhy_df,
    treatment="feature_exposure",
    outcome="weekly_value",
    graph=estimator_graph,
)

model_metadata = pd.DataFrame(
    [
        {"component": "common_causes_reported_by_DoWhy", "value": ", ".join(estimator_model.get_common_causes())},
        {"component": "instruments_reported_by_DoWhy", "value": ", ".join(estimator_model.get_instruments()) or "none detected"},
        {"component": "effect_modifiers_reported_by_DoWhy", "value": ", ".join(estimator_model.get_effect_modifiers()) or "none detected"},
    ]
)

identified_estimand = estimator_model.identify_effect(proceed_when_unidentifiable=True)

model_metadata.to_csv(TABLE_DIR / "04_dowhy_model_metadata.csv", index=False)
model_metadata

	component	value
0	common_causes_reported_by_DoWhy	user_engagement, prior_sessions, account_age_w...
1	instruments_reported_by_DoWhy	none detected
2	effect_modifiers_reported_by_DoWhy	none detected

The detected common causes match the graph. That is the key checkpoint before comparing estimators.

Print The Identified Estimand

The estimand output is verbose, but it is the causal heart of the notebook. It states what must be true before any of the estimators can be read causally.

print(identified_estimand)

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
        d                                                                      ↪
──────────────────(E[weekly_value|user_engagement,prior_sessions,account_age_w ↪
d[featureₑₓₚₒₛᵤᵣₑ]                                                             ↪

↪                                    
↪ eeks,baseline_value,is_power_user])
↪                                    
Estimand assumption 1, Unconfoundedness: If U→{feature_exposure} and U→weekly_value then P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user,U) = P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
        d                                                                      ↪
──────────────────(E[weekly_value|user_engagement,prior_sessions,account_age_w ↪
d[featureₑₓₚₒₛᵤᵣₑ]                                                             ↪

↪                                    
↪ eeks,baseline_value,is_power_user])
↪                                    
Estimand assumption 1, Unconfoundedness: If U→{feature_exposure} and U→weekly_value then P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user,U) = P(weekly_value|feature_exposure,user_engagement,prior_sessions,account_age_weeks,baseline_value,is_power_user)

The estimand is a backdoor-adjusted average treatment effect. The next cell changes only the estimator, not the causal target.

Run DoWhy Estimators

DoWhy provides estimator names for regression, matching, stratification, and weighting. This cell runs all four against the same identified estimand.

dowhy_estimator_specs = [
    ("dowhy_linear_regression", "backdoor.linear_regression"),
    ("dowhy_propensity_score_matching", "backdoor.propensity_score_matching"),
    ("dowhy_propensity_score_stratification", "backdoor.propensity_score_stratification"),
    ("dowhy_propensity_score_weighting", "backdoor.propensity_score_weighting"),
]

dowhy_rows = []
for label, method_name in dowhy_estimator_specs:
    try:
        estimate = estimator_model.estimate_effect(identified_estimand, method_name=method_name)
        dowhy_rows.append(
            {
                "estimator": label,
                "method_name": method_name,
                "estimate": float(estimate.value),
                "absolute_error_vs_true_ate": abs(float(estimate.value) - TRUE_ATE),
                "status": "ok",
            }
        )
    except Exception as exc:
        dowhy_rows.append(
            {
                "estimator": label,
                "method_name": method_name,
                "estimate": np.nan,
                "absolute_error_vs_true_ate": np.nan,
                "status": f"failed: {exc}",
            }
        )

dowhy_estimator_comparison = pd.DataFrame(dowhy_rows)
dowhy_estimator_comparison.to_csv(TABLE_DIR / "04_dowhy_estimator_comparison.csv", index=False)
dowhy_estimator_comparison

	estimator	method_name	estimate	absolute_error_vs_true_ate	status
0	dowhy_linear_regression	backdoor.linear_regression	1.675725	0.025725	ok
1	dowhy_propensity_score_matching	backdoor.propensity_score_matching	1.734642	0.084642	ok
2	dowhy_propensity_score_stratification	backdoor.propensity_score_stratification	1.644848	0.005152	ok
3	dowhy_propensity_score_weighting	backdoor.propensity_score_weighting	1.758190	0.108190	ok

The DoWhy estimates should be near the true effect and much closer than the naive association. Differences across estimators reflect finite-sample behavior and estimator-specific modeling choices.

Manual And DoWhy Estimates Side By Side

This table combines the manual and DoWhy estimates. The exact manual and DoWhy values may differ because the internal implementations are not identical, but they should tell a similar causal story.

manual_for_join = manual_estimator_comparison.assign(source="manual")[["source", "estimator", "estimate", "absolute_error_vs_true_ate"]]
dowhy_for_join = dowhy_estimator_comparison.assign(source="dowhy")[["source", "estimator", "estimate", "absolute_error_vs_true_ate"]]

all_estimator_comparison = pd.concat([manual_for_join, dowhy_for_join], ignore_index=True)
all_estimator_comparison.to_csv(TABLE_DIR / "04_all_estimator_comparison.csv", index=False)
all_estimator_comparison

	source	estimator	estimate	absolute_error_vs_true_ate
0	manual	known_true_ate	1.650000	0.000000
1	manual	naive_difference_in_means	3.442531	1.792531
2	manual	adjusted_outcome_regression	1.675725	0.025725
3	manual	propensity_score_matching	1.732422	0.082422
4	manual	propensity_score_stratification	1.765627	0.115627
5	manual	ipw_weighting	1.802791	0.152791
6	manual	normalized_ipw_weighting	1.753315	0.103315
7	dowhy	dowhy_linear_regression	1.675725	0.025725
8	dowhy	dowhy_propensity_score_matching	1.734642	0.084642
9	dowhy	dowhy_propensity_score_stratification	1.644848	0.005152
10	dowhy	dowhy_propensity_score_weighting	1.758190	0.108190

The combined table is the main result of the notebook. A good causal workflow asks whether the estimator family changes the conclusion enough to affect the decision.

Plot All Estimator Results

This plot compares all estimates against the known true ATE. In real data the true ATE is unavailable, but this plot style is still useful with a reference estimate or sensitivity band.

plot_all = all_estimator_comparison.query("estimator != 'known_true_ate'").copy()

fig, ax = plt.subplots(figsize=(12, 7))
sns.scatterplot(
    data=plot_all,
    x="estimate",
    y="estimator",
    hue="source",
    s=90,
    ax=ax,
)
ax.axvline(TRUE_ATE, color="#111827", linestyle="--", linewidth=1.4, label="Known true ATE")
ax.set_title("Manual And DoWhy Estimators For The Same Backdoor Estimand")
ax.set_xlabel("Estimated effect on weekly_value")
ax.set_ylabel("")
ax.legend(loc="lower right")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_all_estimator_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

The adjusted estimators cluster near the true effect, while the naive estimate is far away. This is the central practical lesson: the causal graph and adjustment logic matter more than the surface sophistication of the estimator.

Estimator Choice Guide

This table turns the comparison into practical guidance. It is not a rigid rulebook, but it helps students choose a reasonable first estimator and know what to check next.

estimator_choice_guide = pd.DataFrame(
    [
        {
            "situation": "Need a transparent first estimate",
            "reasonable_start": "Outcome regression",
            "what_to_check": "Model specification, residual patterns, and whether treatment effect is plausibly constant.",
        },
        {
            "situation": "Want intuitive treated-control comparisons",
            "reasonable_start": "Propensity-score matching",
            "what_to_check": "Match distances, unmatched regions, and whether matching changes the target population.",
        },
        {
            "situation": "Want a simple propensity-based diagnostic table",
            "reasonable_start": "Propensity-score stratification",
            "what_to_check": "Treated/control counts and effect stability inside strata.",
        },
        {
            "situation": "Want a population-level weighted contrast",
            "reasonable_start": "Propensity-score weighting",
            "what_to_check": "Overlap, extreme weights, and weighted covariate balance.",
        },
        {
            "situation": "Estimators disagree materially",
            "reasonable_start": "Do not average them blindly",
            "what_to_check": "Graph assumptions, overlap, model misspecification, outliers, and target population differences.",
        },
    ]
)

estimator_choice_guide.to_csv(TABLE_DIR / "04_estimator_choice_guide.csv", index=False)
estimator_choice_guide

	situation	reasonable_start	what_to_check
0	Need a transparent first estimate	Outcome regression	Model specification, residual patterns, and wh...
1	Want intuitive treated-control comparisons	Propensity-score matching	Match distances, unmatched regions, and whethe...
2	Want a simple propensity-based diagnostic table	Propensity-score stratification	Treated/control counts and effect stability in...
3	Want a population-level weighted contrast	Propensity-score weighting	Overlap, extreme weights, and weighted covaria...
4	Estimators disagree materially	Do not average them blindly	Graph assumptions, overlap, model misspecifica...

The right response to estimator disagreement is investigation, not automatic ensemble averaging. Estimators are diagnostic lenses on the same causal design.

Final Summary

This final table gives a compact report-ready summary of the notebook’s causal result.

final_summary = pd.DataFrame(
    [
        {"item": "Causal question", "summary": "Average total effect of feature exposure on weekly value."},
        {"item": "Identified estimand", "summary": "Backdoor-adjusted ATE using observed pre-treatment common causes."},
        {"item": "Known true ATE", "summary": f"{TRUE_ATE:.3f}"},
        {"item": "Naive estimate", "summary": f"{naive_effect:.3f}; inflated because treated users are stronger at baseline."},
        {"item": "Adjusted regression estimate", "summary": f"{regression_effect:.3f}"},
        {"item": "Manual matching estimate", "summary": f"{manual_matching_ate:.3f}"},
        {"item": "Manual stratification estimate", "summary": f"{manual_stratification_ate:.3f}"},
        {"item": "Manual normalized IPW estimate", "summary": f"{manual_normalized_ipw_ate:.3f}"},
        {"item": "Main diagnostic", "summary": "Adjusted estimators cluster near the true effect and improve sharply over the naive comparison."},
        {"item": "Main limitation", "summary": "All estimates still depend on the observed-confounding assumption and adequate overlap."},
    ]
)

final_summary.to_csv(TABLE_DIR / "04_final_estimator_summary.csv", index=False)
final_summary

	item	summary
0	Causal question	Average total effect of feature exposure on we...
1	Identified estimand	Backdoor-adjusted ATE using observed pre-treat...
2	Known true ATE	1.650
3	Naive estimate	3.443; inflated because treated users are stro...
4	Adjusted regression estimate	1.676
5	Manual matching estimate	1.732
6	Manual stratification estimate	1.766
7	Manual normalized IPW estimate	1.753
8	Main diagnostic	Adjusted estimators cluster near the true effe...
9	Main limitation	All estimates still depend on the observed-con...

The final summary separates the causal claim from the estimator mechanics. That is the habit to preserve: report the graph assumptions, the estimand, the estimators, and the diagnostics together.

Student Exercises

Try these after running the notebook:

Increase the strength of confounding in the treatment assignment equation and watch the naive estimate move farther away.
Reduce sample size and see which estimators become noisier.
Change the number of propensity strata from 10 to 5 or 20.
Match on all standardized covariates instead of only the propensity score.
Remove clipping in the IPW estimator and inspect the maximum weight.
Add a nonlinear outcome term and compare how regression and propensity estimators respond.

Closing Notes

This notebook showed that multiple estimators can target the same identified causal effect. Regression, matching, stratification, and weighting are different ways to operationalize the same backdoor logic.

The next tutorial will focus more deeply on weighting, overlap, common support, and why extreme propensity scores can make causal estimates fragile.