DoubleML Tutorial 07: Difference-In-Differences DID

This notebook teaches difference-in-differences with DoubleML. The goal is to make the causal design feel concrete before touching the package API.

Difference-in-differences, usually shortened to DID, is used when one group receives a treatment after a time cutoff and another comparable group does not. The basic idea is simple: compare how the treated group changed over time to how the control group changed over the same time period. The causal content comes from the assumption that, without treatment, the treated group would have followed the same trend as the control group after conditioning on the right pre-treatment information.

This notebook uses synthetic data so that the true effect is known. That makes the lesson useful for learning: we can see when raw DID works, when it fails, and how DoubleML uses nuisance learning to adjust for observed differences between treated and control groups.

Learning Goals

By the end of this notebook, you should be able to:

  • describe the DID estimand in potential-outcome language;
  • distinguish panel DID from repeated cross-section DID;
  • explain why parallel trends is an identification assumption, not a model fit metric;
  • build DoubleMLDIDData objects for panel and repeated cross-section settings;
  • fit DoubleMLDID and DoubleMLDIDCS with regression and propensity nuisance learners;
  • read nuisance diagnostics, propensity overlap plots, placebo checks, and stress tests;
  • write cautious DID conclusions that separate estimates from assumptions.

The DID Problem

The common two-period DID setting has four cells:

  • control group before treatment;
  • control group after treatment;
  • treated group before treatment;
  • treated group after treatment.

The treatment is not just the group label. The treated group receives the treatment only in the post period. In symbols, if D marks the treated group and T marks the post period, actual exposure is D * T.

The raw DID contrast is:

(mean outcome treated post - mean outcome treated pre)
- (mean outcome control post - mean outcome control pre)

This subtracts away time shocks that both groups share. The design is powerful because it can remove stable group differences and common time shocks. It is also fragile because it does not automatically remove group-specific trends.

What DoubleML Adds

DID can be combined with flexible nuisance models. For panel data with two periods, DoubleML works with the outcome change:

Delta Y = Y_post - Y_pre

The important nuisance functions are:

  • g0(0, X): the expected untreated outcome change among controls, conditional on covariates;
  • g0(1, X): the expected outcome change among treated units, conditional on covariates;
  • m0(X): the probability that a unit belongs to the treated group.

The propensity model m0(X) matters when treatment-group membership is observational rather than randomized. DoubleML uses cross-fitting so the nuisance predictions for each observation are produced by models that did not train on that observation. The score is designed to be Neyman-orthogonal, meaning small nuisance-model errors have a reduced first-order effect on the final treatment estimate.

Panel DID Versus Repeated Cross Sections

This notebook covers two DID data structures.

Panel DID observes the same units before and after treatment. The natural outcome for DoubleML is the within-unit change, Y_post - Y_pre. This is efficient because stable unit-level differences are differenced out.

Repeated cross-section DID observes different samples before and after treatment. There is no unit-level change. Instead, the model compares outcome levels across the four group-time cells while learning conditional outcome functions for each cell.

DoubleML has separate estimators for these cases:

  • DoubleMLDID for two-period panel DID;
  • DoubleMLDIDCS for two-period repeated cross-section DID.

The installed package version used here exposes these class names. Some future versions may expose renamed binary-treatment DID classes; the workflow and assumptions are the same idea.

Runtime Note

This notebook fits several DoubleML models, including repeated sample splits and small stress tests. On a typical laptop, a full run should take about 2 to 4 minutes. The synthetic samples are large enough to make the diagnostics stable but small enough to keep the tutorial interactive.

Setup

The setup cell imports the scientific Python stack, configures output folders, suppresses package deprecation notices that would otherwise distract from the lesson, and records package versions. The warning filters are intentionally narrow: they keep tutorial output readable while still allowing ordinary runtime errors to surface.

from pathlib import Path
import os
import warnings

PROJECT_ROOT = Path.cwd().resolve()
for candidate in [PROJECT_ROOT, *PROJECT_ROOT.parents]:
    if (candidate / "pyproject.toml").exists():
        PROJECT_ROOT = candidate
        break

OUTPUT_DIR = PROJECT_ROOT / "notebooks" / "tutorials" / "doubleml" / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
REPORT_DIR = OUTPUT_DIR / "reports"
MATPLOTLIB_CACHE_DIR = OUTPUT_DIR / "matplotlib_cache"

for directory in [DATASET_DIR, FIGURE_DIR, TABLE_DIR, REPORT_DIR, MATPLOTLIB_CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

os.environ.setdefault("MPLCONFIGDIR", str(MATPLOTLIB_CACHE_DIR))
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="IProgress not found.*", category=Warning)
warnings.filterwarnings("ignore", message="DoubleMLDID.*deprecated.*", category=DeprecationWarning)
warnings.filterwarnings("ignore", message="DoubleMLDIDCS.*deprecated.*", category=DeprecationWarning)
warnings.filterwarnings("ignore", message="DoubleMLDIDData.*deprecated.*", category=DeprecationWarning)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

import sklearn
from sklearn.base import clone
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import brier_score_loss, log_loss, mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import statsmodels.formula.api as smf

import doubleml as dml
from doubleml import DoubleMLDID, DoubleMLDIDCS, DoubleMLDIDData

NOTEBOOK_PREFIX = "07"
RANDOM_SEED = 42
sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams.update({"figure.dpi": 120, "savefig.dpi": 160})

package_versions = pd.DataFrame(
    [
        {"package": "python", "version": os.sys.version.split()[0]},
        {"package": "doubleml", "version": dml.__version__},
        {"package": "numpy", "version": np.__version__},
        {"package": "pandas", "version": pd.__version__},
        {"package": "scikit-learn", "version": sklearn.__version__},
        {"package": "matplotlib", "version": plt.matplotlib.__version__},
        {"package": "seaborn", "version": sns.__version__},
    ]
)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)
package version
0 python 3.13.12
1 doubleml 0.11.2
2 numpy 2.4.4
3 pandas 3.0.2
4 scikit-learn 1.6.1
5 matplotlib 3.10.9
6 seaborn 0.13.2

The version table is part of the reproducibility record. DID behavior can depend on package version because estimator class names and optional APIs evolve over time. The notebook uses the classes available in the installed environment and keeps the causal assumptions explicit so the tutorial remains conceptually portable.

Helper Functions

The helper functions below keep the notebook focused on causal ideas. They save tables consistently, build reusable learners, summarize DID contrasts, extract DoubleML predictions, and compute simple diagnostics.

The comments are intentionally specific. DID notebooks can become confusing quickly because there are several related quantities: group labels, post-period labels, actual exposure, raw DID contrasts, covariate-adjusted estimates, and DoubleML estimates.

def save_table(df, file_name, index=False):
    """Save a table under the notebook output folder and return the DataFrame for display."""
    path = TABLE_DIR / file_name
    df.to_csv(path, index=index)
    return df


def sigmoid(x):
    """Numerically stable enough logistic transform for the synthetic designs in this notebook."""
    return 1.0 / (1.0 + np.exp(-x))


def rmse(y_true, y_pred):
    """Root mean squared error with arrays coerced to one dimension."""
    return float(np.sqrt(mean_squared_error(np.asarray(y_true).ravel(), np.asarray(y_pred).ravel())))


def clip_probability(values, lower=0.01, upper=0.99):
    """Clip propensity-style predictions away from 0 and 1 to avoid unstable inverse weights."""
    return np.clip(np.asarray(values, dtype=float), lower, upper)


def predict_probability(model, X):
    """Return P(class=1) from a classifier that implements predict_proba."""
    return model.predict_proba(X)[:, 1]


def make_linear_learners():
    """Low-variance nuisance learners that match the mostly linear synthetic design."""
    ml_g = make_pipeline(StandardScaler(), LinearRegression())
    ml_m = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1_000))
    return ml_g, ml_m


def make_hgb_learners(seed=RANDOM_SEED, max_iter=90):
    """Tree-based nuisance learners for flexible, nonlinear adjustment."""
    ml_g = HistGradientBoostingRegressor(
        max_iter=max_iter,
        max_leaf_nodes=15,
        min_samples_leaf=30,
        learning_rate=0.06,
        random_state=seed,
    )
    ml_m = HistGradientBoostingClassifier(
        max_iter=max_iter,
        max_leaf_nodes=15,
        min_samples_leaf=30,
        learning_rate=0.06,
        random_state=seed,
    )
    return ml_g, ml_m


def raw_panel_did(df, y_col="delta_weekly_value", d_col="treated"):
    """For panel data, raw DID is the treated-control difference in outcome changes."""
    grouped = df.groupby(d_col)[y_col].agg(["mean", "std", "size"]).reset_index()
    treated_mean = grouped.loc[grouped[d_col] == 1, "mean"].iloc[0]
    control_mean = grouped.loc[grouped[d_col] == 0, "mean"].iloc[0]
    return float(treated_mean - control_mean), grouped


def raw_repeated_cs_did(df, y_col="weekly_value", d_col="treated", t_col="post"):
    """For repeated cross sections, raw DID uses the four group-time cell means."""
    cell_means = df.groupby([d_col, t_col])[y_col].mean().unstack(t_col)
    estimate = (cell_means.loc[1, 1] - cell_means.loc[1, 0]) - (cell_means.loc[0, 1] - cell_means.loc[0, 0])
    return float(estimate), cell_means


def summarize_doubleml(model, label, design, true_target):
    """Create a compact one-row summary from a fitted DoubleML object."""
    ci = model.confint(level=0.95).iloc[0]
    theta = float(model.coef[0])
    se = float(model.se[0])
    return {
        "estimator": label,
        "design": design,
        "theta_hat": theta,
        "std_error": se,
        "ci_95_lower": float(ci.iloc[0]),
        "ci_95_upper": float(ci.iloc[1]),
        "true_target": float(true_target),
        "bias_vs_target": theta - float(true_target),
    }


def prediction_vector(model, key):
    """Extract a DoubleML nuisance prediction array as a flat vector."""
    return np.asarray(model.predictions[key]).reshape(-1)


def learner_loss_table(model, model_label):
    """Convert DoubleML learner-loss output into a tidy table."""
    losses = model.evaluate_learners()
    rows = []
    for learner_name, values in losses.items():
        rows.append(
            {
                "model": model_label,
                "learner": learner_name,
                "metric_value": float(np.asarray(values).mean()),
                "metric": "RMSE for outcome learners; classification error style score for propensity learner",
            }
        )
    return pd.DataFrame(rows)


def add_ci_columns(df):
    """Prepare symmetric error-bar columns for estimate plots."""
    out = df.copy()
    out["lower_error"] = out["theta_hat"] - out["ci_95_lower"]
    out["upper_error"] = out["ci_95_upper"] - out["theta_hat"]
    return out

These helpers do not hide any causal logic. They simply keep repeated output formatting from crowding the tutorial. The core DID calculations still appear directly in later cells.

Teaching Diagram

Before generating data, it helps to draw the design. The arrows below separate three ideas:

  • pre-treatment covariates influence whether a unit belongs to the treated group;
  • time affects outcomes for everyone;
  • the actual treatment occurs only for treated-group units in the post period.

The dashed path marks the danger zone: treated and control groups can have different untreated trends. DID is credible only when those trend differences are absent or can be made comparable after conditioning on observed covariates.

from matplotlib.patches import FancyArrowPatch, FancyBboxPatch

nodes = {
    "X": {"xy": (0.10, 0.52), "label": "Pre-treatment\ncontrols X", "color": "#dbeafe"},
    "D": {"xy": (0.35, 0.76), "label": "Treated\ngroup D", "color": "#dcfce7"},
    "T": {"xy": (0.35, 0.28), "label": "Post\nperiod T", "color": "#fef3c7"},
    "A": {"xy": (0.62, 0.52), "label": "Actual exposure\nD x T", "color": "#fde68a"},
    "Y": {"xy": (0.88, 0.52), "label": "Outcome\nY", "color": "#fee2e2"},
    "R": {"xy": (0.62, 0.86), "label": "Untreated trend\nrisk", "color": "#f3f4f6"},
}

fig, ax = plt.subplots(figsize=(12, 6))
ax.set_axis_off()
box_w, box_h = 0.15, 0.11


def anchor(node, side):
    """Return a border point on a node box so arrows do not start from hidden centers."""
    x, y = nodes[node]["xy"]
    offsets = {
        "left": (-box_w / 2, 0),
        "right": (box_w / 2, 0),
        "top": (0, box_h / 2),
        "bottom": (0, -box_h / 2),
        "upper_left": (-box_w / 2, box_h * 0.25),
        "upper_right": (box_w / 2, box_h * 0.25),
        "lower_left": (-box_w / 2, -box_h * 0.25),
        "lower_right": (box_w / 2, -box_h * 0.25),
    }
    dx, dy = offsets[side]
    return (x + dx, y + dy)


def draw_arrow(start, end, color, style="solid", rad=0.0, linewidth=1.7):
    """Draw one visible arrow between already-chosen border points."""
    arrow = FancyArrowPatch(
        start,
        end,
        arrowstyle="-|>",
        mutation_scale=18,
        linewidth=linewidth,
        color=color,
        linestyle=style,
        shrinkA=0,
        shrinkB=0,
        connectionstyle=f"arc3,rad={rad}",
        zorder=5,
    )
    ax.add_patch(arrow)


def draw_routed_arrow(points, color, style="solid", linewidth=1.7):
    """Draw a routed arrow where early segments are lines and only the final segment has an arrowhead."""
    for start, end in zip(points[:-2], points[1:-1]):
        ax.plot(
            [start[0], end[0]],
            [start[1], end[1]],
            color=color,
            linestyle=style,
            linewidth=linewidth,
            zorder=5,
        )
    draw_arrow(points[-2], points[-1], color=color, style=style, linewidth=linewidth)

# Direct arrows that can be drawn clearly between adjacent boxes.
draw_arrow(anchor("X", "upper_right"), anchor("D", "left"), color="#334155", rad=0.08)
draw_arrow(anchor("D", "lower_right"), anchor("A", "upper_left"), color="#15803d")
draw_arrow(anchor("T", "upper_right"), anchor("A", "lower_left"), color="#b45309")
draw_arrow(anchor("A", "right"), anchor("Y", "left"), color="#b45309")
draw_arrow(anchor("R", "lower_right"), anchor("Y", "upper_left"), color="#6b7280", style="dashed", rad=-0.08, linewidth=1.5)

# Routed arrows avoid passing behind the D, A, and Y boxes.
draw_routed_arrow([anchor("X", "top"), (0.24, 0.91), anchor("R", "left")], color="#6b7280", style="dashed", linewidth=1.5)
draw_routed_arrow([anchor("X", "lower_right"), (0.50, 0.41), anchor("Y", "lower_left")], color="#334155")
draw_routed_arrow([anchor("T", "lower_right"), (0.62, 0.21), anchor("Y", "bottom")], color="#334155")

for spec in nodes.values():
    x, y = spec["xy"]
    rect = FancyBboxPatch(
        (x - box_w / 2, y - box_h / 2),
        box_w,
        box_h,
        boxstyle="round,pad=0.018",
        facecolor=spec["color"],
        edgecolor="#334155",
        linewidth=1.2,
        zorder=3,
    )
    ax.add_patch(rect)
    ax.text(x, y, spec["label"], ha="center", va="center", fontsize=11, fontweight="bold", zorder=4)

ax.text(
    0.50,
    0.08,
    "DID compares before-after changes. DoubleML helps adjust for X, but the trend assumption remains a design assumption.",
    ha="center",
    va="center",
    fontsize=10,
    color="#475569",
)
ax.set_title("Difference-In-Differences Teaching Design", pad=18)
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_did_design_dag.png", dpi=160, bbox_inches="tight")
plt.show()

The diagram makes the exposure structure explicit. A treated-group label alone is not the treatment. A post-period label alone is not the treatment. The treatment is the interaction of the two.

Synthetic Panel Data

The first dataset observes the same units before and after treatment. Treatment-group membership is intentionally observational: high-engagement and high-activity units are more likely to be in the treated group. Those same covariates also affect untreated outcome trends.

That creates a useful teaching setting. The raw DID contrast is biased because treated and control groups have different covariate mixes. Conditional parallel trends is valid in the synthetic design because the untreated trend depends on observed covariates, not on unobserved group-specific shocks.

rng = np.random.default_rng(RANDOM_SEED)
N_PANEL = 8_000

engagement_score = rng.normal(0, 1, N_PANEL)
recent_activity = rng.normal(0, 1, N_PANEL)
content_fit = rng.normal(0, 1, N_PANEL)
price_sensitivity = rng.normal(0, 1, N_PANEL)
tenure_signal = rng.uniform(0, 1, N_PANEL)
mobile_user = rng.binomial(1, 0.55, N_PANEL)

panel_x = pd.DataFrame(
    {
        "engagement_score": engagement_score,
        "recent_activity": recent_activity,
        "content_fit": content_fit,
        "price_sensitivity": price_sensitivity,
        "tenure_signal": tenure_signal,
        "mobile_user": mobile_user,
    }
)
feature_cols = list(panel_x.columns)

# Treatment-group membership is observational and depends on pre-treatment covariates.
true_propensity = sigmoid(
    -0.10
    + 0.50 * engagement_score
    + 0.35 * recent_activity
    - 0.35 * price_sensitivity
    + 0.15 * mobile_user
)
treated = rng.binomial(1, true_propensity)

# Baseline levels differ across users, but the panel DID model works with changes.
baseline_level = (
    4.0
    + 0.80 * engagement_score
    + 0.50 * recent_activity
    + 0.70 * content_fit
    - 0.40 * price_sensitivity
    + 0.20 * mobile_user
    + rng.normal(0, 0.60, N_PANEL)
)

# Untreated trends vary with observed covariates. This makes unconditional DID imperfect.
expected_untreated_trend = (
    0.25
    + 0.25 * engagement_score
    + 0.18 * recent_activity
    + 0.12 * content_fit
    - 0.08 * price_sensitivity
    + 0.05 * mobile_user
)
untreated_trend = expected_untreated_trend + rng.normal(0, 0.10, N_PANEL)

# Treatment effects are heterogeneous. The target for DID is the treated-group average post effect.
true_tau = 1.00 + 0.35 * engagement_score + 0.25 * content_fit - 0.15 * price_sensitivity + 0.10 * mobile_user

y_pre = baseline_level + rng.normal(0, 0.45, N_PANEL)
y_post = baseline_level + untreated_trend + true_tau * treated + rng.normal(0, 0.45, N_PANEL)
delta_weekly_value = y_post - y_pre

panel_df = pd.concat(
    [
        pd.DataFrame({"unit_id": np.arange(N_PANEL)}),
        panel_x,
        pd.DataFrame(
            {
                "treated": treated,
                "y_pre": y_pre,
                "y_post": y_post,
                "delta_weekly_value": delta_weekly_value,
                "expected_untreated_trend": expected_untreated_trend,
                "true_untreated_trend": untreated_trend,
                "true_tau": true_tau,
                "true_propensity": true_propensity,
            }
        ),
    ],
    axis=1,
)

TRUE_PANEL_ATT = panel_df.loc[panel_df["treated"] == 1, "true_tau"].mean()
TRUE_PANEL_ATE = panel_df["true_tau"].mean()

panel_path = DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_panel_did.csv"
panel_df.to_csv(panel_path, index=False)

print(f"Panel rows: {len(panel_df):,}")
print(f"True panel ATT: {TRUE_PANEL_ATT:.4f}")
print(f"True panel ATE: {TRUE_PANEL_ATE:.4f}")
print(f"Saved panel data to: {panel_path.relative_to(PROJECT_ROOT)}")
display(panel_df.head())
Panel rows: 8,000
True panel ATT: 1.1606
True panel ATE: 1.0594
Saved panel data to: notebooks/tutorials/doubleml/outputs/datasets/07_synthetic_panel_did.csv
unit_id engagement_score recent_activity content_fit price_sensitivity tenure_signal mobile_user treated y_pre y_post delta_weekly_value expected_untreated_trend true_untreated_trend true_tau true_propensity
0 0 0.304717 0.333602 -1.322489 -0.422500 0.572782 1 0 3.907776 5.766276 1.858501 0.311329 0.374285 0.939404 0.614672
1 1 -1.039984 1.236343 1.389147 0.969549 0.251107 1 0 5.808336 6.529171 0.720835 0.351679 0.202828 0.937860 0.406944
2 2 0.750451 1.215829 -0.348241 -0.432650 0.371131 0 1 4.734420 6.621476 1.887057 0.649285 0.752185 1.240495 0.701027
3 3 0.940565 -0.450957 -0.071824 0.354970 0.233027 0 0 4.650214 5.758102 1.107888 0.366952 0.434669 1.257996 0.522038
4 4 -1.951035 0.194017 0.283535 -0.125404 0.236977 0 1 3.023087 2.352221 -0.670866 -0.158779 -0.199061 0.406832 0.276134

The true ATT is larger than the true ATE because higher-effect units are more likely to be in the treated group. This is not a bug. DID with a treated-group policy usually targets the effect for the treated group, not necessarily the population-wide average effect.

Panel Field Dictionary

A field dictionary turns the synthetic dataset into a readable teaching object. It also mirrors good applied workflow: before fitting any causal model, make the unit, treatment, time, outcome, and covariate roles explicit.

panel_field_dictionary = pd.DataFrame(
    [
        {"field": "unit_id", "role": "identifier", "description": "Unique unit observed before and after treatment."},
        {"field": "treated", "role": "group indicator", "description": "Equals 1 for units in the treated group; treatment exposure happens only after the policy starts."},
        {"field": "y_pre", "role": "pre outcome", "description": "Outcome measured before treatment exposure."},
        {"field": "y_post", "role": "post outcome", "description": "Outcome measured after treated-group units receive exposure."},
        {"field": "delta_weekly_value", "role": "DID outcome", "description": "Within-unit change, y_post minus y_pre, used by panel DID."},
        {"field": "engagement_score", "role": "covariate", "description": "Pre-treatment engagement signal that affects group membership and trends."},
        {"field": "recent_activity", "role": "covariate", "description": "Pre-treatment activity signal."},
        {"field": "content_fit", "role": "covariate", "description": "Pre-treatment fit signal related to outcomes and treatment effects."},
        {"field": "price_sensitivity", "role": "covariate", "description": "Pre-treatment risk or sensitivity signal."},
        {"field": "tenure_signal", "role": "covariate", "description": "Pre-treatment tenure-like signal."},
        {"field": "mobile_user", "role": "covariate", "description": "Binary pre-treatment segment marker."},
        {"field": "expected_untreated_trend", "role": "simulation truth", "description": "Expected no-treatment trend given observed covariates."},
        {"field": "true_untreated_trend", "role": "simulation truth", "description": "Realized no-treatment trend including noise."},
        {"field": "true_tau", "role": "simulation truth", "description": "Individual treatment effect in the post period."},
        {"field": "true_propensity", "role": "simulation truth", "description": "True probability of treated-group membership."},
    ]
)
save_table(panel_field_dictionary, f"{NOTEBOOK_PREFIX}_panel_field_dictionary.csv")
display(panel_field_dictionary)
field role description
0 unit_id identifier Unique unit observed before and after treatment.
1 treated group indicator Equals 1 for units in the treated group; treat...
2 y_pre pre outcome Outcome measured before treatment exposure.
3 y_post post outcome Outcome measured after treated-group units rec...
4 delta_weekly_value DID outcome Within-unit change, y_post minus y_pre, used b...
5 engagement_score covariate Pre-treatment engagement signal that affects g...
6 recent_activity covariate Pre-treatment activity signal.
7 content_fit covariate Pre-treatment fit signal related to outcomes a...
8 price_sensitivity covariate Pre-treatment risk or sensitivity signal.
9 tenure_signal covariate Pre-treatment tenure-like signal.
10 mobile_user covariate Binary pre-treatment segment marker.
11 expected_untreated_trend simulation truth Expected no-treatment trend given observed cov...
12 true_untreated_trend simulation truth Realized no-treatment trend including noise.
13 true_tau simulation truth Individual treatment effect in the post period.
14 true_propensity simulation truth True probability of treated-group membership.

The key modeling outcome for panel DID is delta_weekly_value. The pre and post outcome columns remain useful for plotting, raw checks, and sanity checks, but the DoubleML panel DID estimator consumes the change directly.

Basic Panel Data Audit

A DID audit should check sample size, missingness, treatment balance, and outcome ranges before any estimator is fitted. The code below keeps the checks simple and transparent.

panel_audit = pd.DataFrame(
    {
        "metric": [
            "rows",
            "columns",
            "missing_cells",
            "treated_share",
            "mean_pre_outcome",
            "mean_post_outcome",
            "mean_outcome_change",
        ],
        "value": [
            len(panel_df),
            panel_df.shape[1],
            int(panel_df.isna().sum().sum()),
            panel_df["treated"].mean(),
            panel_df["y_pre"].mean(),
            panel_df["y_post"].mean(),
            panel_df["delta_weekly_value"].mean(),
        ],
    }
)
save_table(panel_audit, f"{NOTEBOOK_PREFIX}_panel_data_audit.csv")
display(panel_audit)
metric value
0 rows 8000.000000
1 columns 15.000000
2 missing_cells 0.000000
3 treated_share 0.491625
4 mean_pre_outcome 4.116102
5 mean_post_outcome 4.968973
6 mean_outcome_change 0.852871

The treated share is close to one half, so the tutorial has enough observations in both groups. In applied DID, a tiny treated group or tiny control group would make weighting, overlap, and standard errors much more fragile.

Target Summary

Because this is synthetic data, we can compute the true targets. The ATT is the main DID target here because the estimand asks: among units that actually entered the treated group, what was the average post-period treatment effect?

panel_target_summary = pd.DataFrame(
    [
        {"target": "panel ATT", "value": TRUE_PANEL_ATT, "meaning": "Average true treatment effect among treated-group units."},
        {"target": "panel ATE", "value": TRUE_PANEL_ATE, "meaning": "Average true treatment effect over all units."},
        {"target": "treated share", "value": panel_df["treated"].mean(), "meaning": "Share of units in the treated group."},
        {"target": "mean true propensity", "value": panel_df["true_propensity"].mean(), "meaning": "Average probability of treated-group membership."},
    ]
)
save_table(panel_target_summary, f"{NOTEBOOK_PREFIX}_panel_target_summary.csv")
display(panel_target_summary)
target value meaning
0 panel ATT 1.160608 Average true treatment effect among treated-gr...
1 panel ATE 1.059387 Average true treatment effect over all units.
2 treated share 0.491625 Share of units in the treated group.
3 mean true propensity 0.493144 Average probability of treated-group membership.

The ATT and ATE differ because treatment effects vary and treated-group membership is selective. A strong report should name the target clearly instead of saying only that DID estimated “the effect.”

Covariate Balance By Treatment Group

DID is not a randomized experiment by default. The table below compares treated and control groups on pre-treatment covariates. Large differences do not automatically invalidate DID, but they make unconditional parallel trends less plausible.

balance_rows = []
for col in feature_cols:
    treated_mean = panel_df.loc[panel_df["treated"] == 1, col].mean()
    control_mean = panel_df.loc[panel_df["treated"] == 0, col].mean()
    pooled_sd = panel_df[col].std(ddof=1)
    balance_rows.append(
        {
            "feature": col,
            "treated_mean": treated_mean,
            "control_mean": control_mean,
            "difference": treated_mean - control_mean,
            "standardized_difference": (treated_mean - control_mean) / pooled_sd,
        }
    )

panel_balance = pd.DataFrame(balance_rows).sort_values("standardized_difference", key=np.abs, ascending=False)
save_table(panel_balance, f"{NOTEBOOK_PREFIX}_panel_covariate_balance.csv")
display(panel_balance)
feature treated_mean control_mean difference standardized_difference
0 engagement_score 0.223692 -0.235643 0.459335 0.458236
1 recent_activity 0.159149 -0.170904 0.330054 0.328654
3 price_sensitivity -0.141904 0.163144 -0.305048 -0.302151
5 mobile_user 0.560386 0.537251 0.023135 0.046488
2 content_fit 0.019964 0.058889 -0.038925 -0.039005
4 tenure_signal 0.502406 0.502485 -0.000079 -0.000272

The covariate differences are deliberate. They create the reason to use an observational DID score with nuisance adjustment rather than relying entirely on the raw two-by-two comparison.

Treatment Propensity And Overlap

The propensity here is the probability of belonging to the treated group. DID needs enough overlap: for similar covariate profiles, we should see both treated and control units. When overlap is weak, the estimator has to extrapolate or assign large weights.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.histplot(
    data=panel_df,
    x="true_propensity",
    hue="treated",
    bins=40,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[0],
)
axes[0].set_title("True Treatment-Group Propensity")
axes[0].set_xlabel("P(D=1 | X)")
axes[0].set_ylabel("Density")

sns.scatterplot(
    data=panel_df.sample(1_500, random_state=RANDOM_SEED),
    x="engagement_score",
    y="recent_activity",
    hue="treated",
    alpha=0.55,
    s=24,
    ax=axes[1],
)
axes[1].set_title("Observed Treated And Control Support")
axes[1].set_xlabel("Engagement Score")
axes[1].set_ylabel("Recent Activity")

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_panel_true_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The overlap is not perfect, but both groups appear across most of the support. This is a reasonable teaching case: adjustment matters, but the model is not forced to estimate effects in completely unsupported regions.

Raw Panel DID

For panel data, raw DID is very simple: compute each unit’s before-after change and compare the average change between treated and control groups. This removes stable unit-level differences but does not adjust for covariate-related trends.

raw_panel_estimate, panel_change_by_group = raw_panel_did(panel_df)

raw_panel_summary = pd.DataFrame(
    [
        {
            "estimator": "Raw panel DID",
            "design": "panel",
            "theta_hat": raw_panel_estimate,
            "true_target": TRUE_PANEL_ATT,
            "bias_vs_target": raw_panel_estimate - TRUE_PANEL_ATT,
        }
    ]
)
save_table(panel_change_by_group, f"{NOTEBOOK_PREFIX}_panel_change_by_group.csv")
save_table(raw_panel_summary, f"{NOTEBOOK_PREFIX}_raw_panel_did_summary.csv")

display(panel_change_by_group)
display(raw_panel_summary)
treated mean std size
0 0 0.182551 0.727967 4067
1 1 1.546029 0.967813 3933
estimator design theta_hat true_target bias_vs_target
0 Raw panel DID panel 1.363479 1.160608 0.202871

The raw DID estimate is above the true ATT. The treated group has covariates associated with stronger no-treatment trends, so a simple before-after comparison attributes too much of the natural trend difference to the treatment.

Visualizing The Four Panel Means

Even though panel DID uses outcome changes, the classic four-cell plot is still useful. It shows the pre and post outcome levels for each group and makes the DID contrast easier to explain.

panel_long = panel_df.melt(
    id_vars=["unit_id", "treated"],
    value_vars=["y_pre", "y_post"],
    var_name="period",
    value_name="weekly_value",
)
panel_long["period"] = panel_long["period"].map({"y_pre": "pre", "y_post": "post"})

panel_means = (
    panel_long.groupby(["treated", "period"])["weekly_value"]
    .agg(mean="mean", se=lambda x: x.std(ddof=1) / np.sqrt(len(x)), n="size")
    .reset_index()
)
save_table(panel_means, f"{NOTEBOOK_PREFIX}_panel_four_cell_means.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.pointplot(
    data=panel_means,
    x="period",
    y="mean",
    hue="treated",
    order=["pre", "post"],
    markers=["o", "s"],
    linestyles=["-", "-"],
    dodge=0.05,
    errorbar=None,
    ax=ax,
)
ax.set_title("Observed Panel Outcomes By Group And Period")
ax.set_xlabel("Period")
ax.set_ylabel("Mean Outcome")
ax.legend(title="Treated group")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_panel_four_cell_means.png", dpi=160, bbox_inches="tight")
plt.show()

display(panel_means)

treated period mean se n
0 0 post 3.996322 0.026942 4067
1 0 pre 3.813771 0.022440 4067
2 1 post 5.974763 0.032450 3933
3 1 pre 4.428733 0.022632 3933

The treated group starts higher and grows more. DID asks whether the extra growth is treatment impact or whether treated units would have grown faster anyway. The synthetic truth says part of the extra growth is pre-treatment covariate composition.

Regression Baselines

A useful baseline is an OLS regression of the outcome change on the treatment-group indicator and covariates. This is not the same as the orthogonal DoubleML score, but it gives a transparent reference point.

formula = "delta_weekly_value ~ treated + " + " + ".join(feature_cols)
ols_panel = smf.ols(formula=formula, data=panel_df).fit(cov_type="HC3")

ols_panel_summary = pd.DataFrame(
    [
        {
            "estimator": "OLS on outcome change with covariates",
            "design": "panel",
            "theta_hat": float(ols_panel.params["treated"]),
            "std_error": float(ols_panel.bse["treated"]),
            "ci_95_lower": float(ols_panel.conf_int().loc["treated", 0]),
            "ci_95_upper": float(ols_panel.conf_int().loc["treated", 1]),
            "true_target": TRUE_PANEL_ATT,
            "bias_vs_target": float(ols_panel.params["treated"] - TRUE_PANEL_ATT),
        }
    ]
)
save_table(ols_panel_summary, f"{NOTEBOOK_PREFIX}_panel_ols_baseline.csv")
display(ols_panel_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 OLS on outcome change with covariates panel 1.075332 0.016391 1.043206 1.107457 1.160608 -0.085276

The covariate-adjusted regression moves closer to the target than raw DID. DoubleML will use a related adjustment idea, but with cross-fitted nuisance models and an orthogonal score that is less sensitive to regularization error.

Manual Cross-Fitted DID Score

Before using the package estimator, we can compute a small manual version of the observational panel DID score. The control outcome-change model estimates what treated units would have experienced without treatment, while the propensity model reweights controls to look more like the treated group.

This manual implementation is included for intuition. The DoubleML estimator should be preferred for serious work because it handles score details, inference, sample splitting, and repeated splits consistently.

def crossfit_panel_dr_did(df, y_col, d_col, x_cols, ml_g, ml_m, n_splits=5, seed=RANDOM_SEED):
    X = df[x_cols].reset_index(drop=True)
    y = df[y_col].to_numpy()
    d = df[d_col].to_numpy()
    n = len(df)

    g0_hat = np.full(n, np.nan)
    m_hat = np.full(n, np.nan)
    splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

    for train_idx, test_idx in splitter.split(X, d):
        control_train_idx = train_idx[d[train_idx] == 0]

        g_model = clone(ml_g)
        m_model = clone(ml_m)

        # The untreated trend model is trained only on controls.
        g_model.fit(X.iloc[control_train_idx], y[control_train_idx])
        g0_hat[test_idx] = g_model.predict(X.iloc[test_idx])

        # The propensity model is trained on all training observations.
        m_model.fit(X.iloc[train_idx], d[train_idx])
        m_hat[test_idx] = predict_probability(m_model, X.iloc[test_idx])

    m_hat = clip_probability(m_hat)
    treated_share = d.mean()
    control_weight = (1 - d) * m_hat / (1 - m_hat)
    normalized_control_weight = control_weight / control_weight.mean()

    score_contribution = (d / treated_share - normalized_control_weight) * (y - g0_hat)
    theta_hat = score_contribution.mean()
    std_error = score_contribution.std(ddof=1) / np.sqrt(n)

    summary = pd.DataFrame(
        [
            {
                "estimator": "Manual cross-fitted DR panel DID",
                "design": "panel",
                "theta_hat": theta_hat,
                "std_error": std_error,
                "ci_95_lower": theta_hat - 1.96 * std_error,
                "ci_95_upper": theta_hat + 1.96 * std_error,
                "true_target": TRUE_PANEL_ATT,
                "bias_vs_target": theta_hat - TRUE_PANEL_ATT,
                "treated_share": treated_share,
                "mean_control_weight": control_weight.mean(),
            }
        ]
    )
    diagnostics = pd.DataFrame(
        {
            "g0_hat": g0_hat,
            "m_hat": m_hat,
            "score_contribution": score_contribution,
            "treated": d,
            "outcome_change": y,
        }
    )
    return summary, diagnostics

linear_g, linear_m = make_linear_learners()
manual_panel_summary, manual_panel_diagnostics = crossfit_panel_dr_did(
    panel_df,
    y_col="delta_weekly_value",
    d_col="treated",
    x_cols=feature_cols,
    ml_g=linear_g,
    ml_m=linear_m,
    n_splits=5,
    seed=RANDOM_SEED,
)
save_table(manual_panel_summary, f"{NOTEBOOK_PREFIX}_manual_panel_dr_did.csv")
save_table(manual_panel_diagnostics.describe().reset_index(), f"{NOTEBOOK_PREFIX}_manual_panel_diagnostics.csv")
display(manual_panel_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target treated_share mean_control_weight
0 Manual cross-fitted DR panel DID panel 1.170154 0.022568 1.125921 1.214386 1.160608 0.009546 0.491625 0.495287

The manual score lands near the true ATT and gives us a bridge to DoubleML. The score contribution column is also useful later because it shows how individual residualized observations combine into the final estimate.

Manual Score Diagnostics

The next plot checks the two pieces that drive the manual score: estimated propensities and score contributions. Extreme propensities or heavy-tailed score contributions would be a warning sign.

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.histplot(
    data=manual_panel_diagnostics,
    x="m_hat",
    hue="treated",
    bins=40,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=axes[0],
)
axes[0].set_title("Manual Cross-Fitted Propensity Estimates")
axes[0].set_xlabel("m_hat(X) = P(D=1 | X)")
axes[0].set_ylabel("Density")

sns.histplot(
    data=manual_panel_diagnostics,
    x="score_contribution",
    bins=60,
    color="#2563eb",
    ax=axes[1],
)
axes[1].axvline(manual_panel_summary["theta_hat"].iloc[0], color="#b91c1c", linestyle="--", label="Mean score")
axes[1].set_title("Manual DID Score Contributions")
axes[1].set_xlabel("Score Contribution")
axes[1].set_ylabel("Count")
axes[1].legend()

plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_manual_panel_score_diagnostics.png", dpi=160, bbox_inches="tight")
plt.show()

The propensity estimates have overlap across groups, and the score distribution is centered around the manual estimate without a small number of observations dominating the plot. This makes the full DoubleML fit more trustworthy as a teaching example.

DoubleML Panel Data Backend

DoubleMLDIDData tells DoubleML which column is the outcome change, which column marks the treated group, and which columns are pre-treatment covariates. For panel DID, we do not pass a time column because the outcome has already been collapsed to a before-after difference.

panel_dml_data = DoubleMLDIDData(
    panel_df,
    y_col="delta_weekly_value",
    d_cols="treated",
    x_cols=feature_cols,
)

panel_backend_summary = pd.DataFrame(
    [
        {"property": "n_obs", "value": panel_dml_data.n_obs},
        {"property": "outcome", "value": panel_dml_data.y_col},
        {"property": "treatment", "value": ", ".join(panel_dml_data.d_cols)},
        {"property": "n_covariates", "value": len(panel_dml_data.x_cols)},
        {"property": "covariates", "value": ", ".join(panel_dml_data.x_cols)},
    ]
)
save_table(panel_backend_summary, f"{NOTEBOOK_PREFIX}_panel_doublemldata_backend.csv")
display(panel_backend_summary)
property value
0 n_obs 8000
1 outcome delta_weekly_value
2 treatment treated
3 n_covariates 6
4 covariates engagement_score, recent_activity, content_fit...

This backend object is the contract between the dataset and the estimator. If any role is wrong here, the estimator can run successfully while answering the wrong causal question.

Fit DoubleMLDID

We fit three panel DID models:

  • observational score with linear nuisance learners;
  • observational score with gradient boosting nuisance learners;
  • experimental score with gradient boosting outcome learners.

The observational score is appropriate when treated-group membership depends on covariates. The experimental score is included as a contrast: it is what we would use if group assignment were as-if randomized with respect to pre-treatment covariates.

panel_model_rows = []
panel_models = {}

ml_g_linear, ml_m_linear = make_linear_learners()
panel_did_linear = DoubleMLDID(
    panel_dml_data,
    ml_g=ml_g_linear,
    ml_m=ml_m_linear,
    n_folds=5,
    score="observational",
)
panel_did_linear.fit(store_predictions=True)
panel_models["Panel observational linear"] = panel_did_linear
panel_model_rows.append(summarize_doubleml(panel_did_linear, "Panel observational linear", "panel", TRUE_PANEL_ATT))

ml_g_hgb, ml_m_hgb = make_hgb_learners(seed=RANDOM_SEED + 7, max_iter=90)
panel_did_hgb = DoubleMLDID(
    panel_dml_data,
    ml_g=ml_g_hgb,
    ml_m=ml_m_hgb,
    n_folds=5,
    score="observational",
)
panel_did_hgb.fit(store_predictions=True)
panel_models["Panel observational HGB"] = panel_did_hgb
panel_model_rows.append(summarize_doubleml(panel_did_hgb, "Panel observational HGB", "panel", TRUE_PANEL_ATT))

ml_g_exp, _ = make_hgb_learners(seed=RANDOM_SEED + 17, max_iter=90)
panel_did_experimental = DoubleMLDID(
    panel_dml_data,
    ml_g=ml_g_exp,
    n_folds=5,
    score="experimental",
)
panel_did_experimental.fit(store_predictions=True)
panel_models["Panel experimental HGB"] = panel_did_experimental
panel_model_rows.append(summarize_doubleml(panel_did_experimental, "Panel experimental HGB", "panel", TRUE_PANEL_ATT))

panel_doubleml_summary = pd.DataFrame(panel_model_rows)
save_table(panel_doubleml_summary, f"{NOTEBOOK_PREFIX}_panel_doubleml_did_summary.csv")
display(panel_doubleml_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 Panel observational linear panel 1.171624 0.018217 1.135920 1.207328 1.160608 0.011016
1 Panel observational HGB panel 1.171300 0.018574 1.134895 1.207704 1.160608 0.010692
2 Panel experimental HGB panel 1.087835 0.015517 1.057423 1.118247 1.160608 -0.072772

The observational DoubleML estimates are close to the true ATT. The experimental score is less appropriate for this simulated design because treated-group membership was not randomized; the comparison is a reminder that the score choice should match the assignment story.

Panel Estimate Comparison

The next figure puts raw DID, OLS adjustment, the manual cross-fitted score, and DoubleML estimates on one scale. A plot like this is useful in applied notebooks because it makes the consequence of adjustment visible.

panel_estimates = pd.concat(
    [
        raw_panel_summary.assign(std_error=np.nan, ci_95_lower=np.nan, ci_95_upper=np.nan),
        ols_panel_summary,
        manual_panel_summary.drop(columns=["treated_share", "mean_control_weight"]),
        panel_doubleml_summary,
    ],
    ignore_index=True,
)

# Add approximate intervals for the raw estimate using a simple group-difference standard error.
raw_group_var = panel_df.groupby("treated")["delta_weekly_value"].agg(["var", "size"])
raw_se = np.sqrt(raw_group_var.loc[1, "var"] / raw_group_var.loc[1, "size"] + raw_group_var.loc[0, "var"] / raw_group_var.loc[0, "size"])
panel_estimates.loc[panel_estimates["estimator"] == "Raw panel DID", "std_error"] = raw_se
panel_estimates.loc[panel_estimates["estimator"] == "Raw panel DID", "ci_95_lower"] = raw_panel_estimate - 1.96 * raw_se
panel_estimates.loc[panel_estimates["estimator"] == "Raw panel DID", "ci_95_upper"] = raw_panel_estimate + 1.96 * raw_se

save_table(panel_estimates, f"{NOTEBOOK_PREFIX}_panel_estimate_comparison.csv")
plot_panel_estimates = add_ci_columns(panel_estimates)

fig, ax = plt.subplots(figsize=(11, 6))
order = plot_panel_estimates.sort_values("theta_hat", ascending=False)["estimator"]
sns.pointplot(
    data=plot_panel_estimates,
    y="estimator",
    x="theta_hat",
    order=order,
    linestyle="none",
    color="#111827",
    errorbar=None,
    ax=ax,
)
for y_pos, (_, row) in enumerate(plot_panel_estimates.set_index("estimator").loc[order].reset_index().iterrows()):
    ax.errorbar(
        x=row["theta_hat"],
        y=y_pos,
        xerr=[[row["lower_error"]], [row["upper_error"]]],
        fmt="none",
        color="#475569",
        capsize=3,
        linewidth=1.2,
    )
ax.axvline(TRUE_PANEL_ATT, color="#b91c1c", linestyle="--", linewidth=1.5, label="True panel ATT")
ax.axvline(TRUE_PANEL_ATE, color="#2563eb", linestyle="--", linewidth=1.5, label="True panel ATE")
ax.set_title("Panel DID Estimate Comparison")
ax.set_xlabel("Estimated Effect")
ax.set_ylabel("")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_panel_estimate_comparison.png", dpi=160, bbox_inches="tight")
plt.show()

display(panel_estimates)

estimator design theta_hat true_target bias_vs_target std_error ci_95_lower ci_95_upper
0 Raw panel DID panel 1.363479 1.160608 0.202871 0.019195 1.325856 1.401101
1 OLS on outcome change with covariates panel 1.075332 1.160608 -0.085276 0.016391 1.043206 1.107457
2 Manual cross-fitted DR panel DID panel 1.170154 1.160608 0.009546 0.022568 1.125921 1.214386
3 Panel observational linear panel 1.171624 1.160608 0.011016 0.018217 1.135920 1.207328
4 Panel observational HGB panel 1.171300 1.160608 0.010692 0.018574 1.134895 1.207704
5 Panel experimental HGB panel 1.087835 1.160608 -0.072772 0.015517 1.057423 1.118247

The raw panel DID estimate is the outlier because it mixes treatment effects with covariate-driven trend differences. The adjusted estimates are much closer to the true treated-group effect.

Panel Nuisance Learner Diagnostics

DoubleML relies on nuisance learners, so we should inspect them. The loss table below is not a causal validity test. It tells us whether the predictive components are behaving reasonably.

panel_loss_tables = []
for label, model in panel_models.items():
    panel_loss_tables.append(learner_loss_table(model, label))

panel_nuisance_losses = pd.concat(panel_loss_tables, ignore_index=True)
save_table(panel_nuisance_losses, f"{NOTEBOOK_PREFIX}_panel_nuisance_losses.csv")
display(panel_nuisance_losses)
model learner metric_value metric
0 Panel observational linear ml_g0 0.648184 RMSE for outcome learners; classification erro...
1 Panel observational linear ml_g1 0.662279 RMSE for outcome learners; classification erro...
2 Panel observational linear ml_m 0.473772 RMSE for outcome learners; classification erro...
3 Panel observational HGB ml_g0 0.654899 RMSE for outcome learners; classification erro...
4 Panel observational HGB ml_g1 0.679874 RMSE for outcome learners; classification erro...
5 Panel observational HGB ml_m 0.477452 RMSE for outcome learners; classification erro...
6 Panel experimental HGB ml_g0 0.658253 RMSE for outcome learners; classification erro...
7 Panel experimental HGB ml_g1 0.679843 RMSE for outcome learners; classification erro...

The outcome nuisance losses are on the scale of outcome-change prediction error. The propensity learner score is a classification-style loss. These diagnostics are useful for debugging, but good prediction is not a substitute for parallel trends.

Panel Nuisance Predictions Against Synthetic Truth

Because this is a synthetic notebook, we can compare estimated propensities to the true propensities. In real data, we would not have the true propensity, but we can still inspect the distribution of predicted propensities and whether the model is producing extreme weights.

panel_hgb_predictions = pd.DataFrame(
    {
        "true_propensity": panel_df["true_propensity"],
        "m_hat_hgb": prediction_vector(panel_did_hgb, "ml_m"),
        "g0_hat_hgb": prediction_vector(panel_did_hgb, "ml_g0"),
        "g1_hat_hgb": prediction_vector(panel_did_hgb, "ml_g1"),
        "treated": panel_df["treated"],
        "delta_weekly_value": panel_df["delta_weekly_value"],
        "expected_untreated_trend": panel_df["expected_untreated_trend"],
    }
)

panel_nuisance_quality = pd.DataFrame(
    [
        {
            "quantity": "propensity m_hat",
            "rmse_vs_truth": rmse(panel_hgb_predictions["true_propensity"], panel_hgb_predictions["m_hat_hgb"]),
            "mae_vs_truth": mean_absolute_error(panel_hgb_predictions["true_propensity"], panel_hgb_predictions["m_hat_hgb"]),
        },
        {
            "quantity": "untreated trend g0_hat",
            "rmse_vs_truth": rmse(panel_hgb_predictions["expected_untreated_trend"], panel_hgb_predictions["g0_hat_hgb"]),
            "mae_vs_truth": mean_absolute_error(panel_hgb_predictions["expected_untreated_trend"], panel_hgb_predictions["g0_hat_hgb"]),
        },
    ]
)
save_table(panel_nuisance_quality, f"{NOTEBOOK_PREFIX}_panel_hgb_nuisance_quality.csv")
display(panel_nuisance_quality)
quantity rmse_vs_truth mae_vs_truth
0 propensity m_hat 0.061934 0.048002
1 untreated trend g0_hat 0.125982 0.097855

The synthetic truth check shows that the nuisance learners are capturing the main structure. This gives us confidence that any remaining difference from the true ATT is mostly sampling noise rather than a broken implementation.

Panel Propensity Plot

The plot below uses the HGB propensity predictions produced during DoubleML cross-fitting. We want treated and control units to overlap in predicted propensity space.

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
    data=panel_hgb_predictions,
    x="m_hat_hgb",
    hue="treated",
    bins=45,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=ax,
)
ax.axvline(0.01, color="#991b1b", linestyle="--", linewidth=1, label="Clipping boundary")
ax.axvline(0.99, color="#991b1b", linestyle="--", linewidth=1)
ax.set_title("DoubleML Panel DID Estimated Propensity Overlap")
ax.set_xlabel("m_hat(X) = P(D=1 | X)")
ax.set_ylabel("Density")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_panel_estimated_propensity_overlap.png", dpi=160, bbox_inches="tight")
plt.show()

The mass is away from 0 and 1, and both groups overlap across much of the range. That supports stable weighting in this synthetic example.

Bootstrap Confidence Interval

DoubleML provides a multiplier bootstrap. Here we bootstrap the HGB observational panel model to show how uncertainty can be reported beyond the default analytic standard error.

panel_did_hgb.bootstrap(method="normal", n_rep_boot=500)
panel_bootstrap_ci = panel_did_hgb.confint(level=0.95).reset_index().rename(columns={"index": "treatment"})
panel_bootstrap_ci["estimator"] = "Panel observational HGB bootstrap CI"
save_table(panel_bootstrap_ci, f"{NOTEBOOK_PREFIX}_panel_hgb_bootstrap_ci.csv")
display(panel_bootstrap_ci)
treatment 2.5 % 97.5 % estimator
0 treated 1.134895 1.207704 Panel observational HGB bootstrap CI

The bootstrap interval is another uncertainty view for the same target. It does not address hidden violations of parallel trends; it quantifies sampling uncertainty under the model and design assumptions.

Repeated Sample Splitting

Cross-fitting uses random folds. A robust tutorial should show whether the estimate changes materially when the folds change. Large instability would suggest that the nuisance learners or sample size need attention.

repeated_rows = []
for split_seed in range(6):
    ml_g_rep, ml_m_rep = make_hgb_learners(seed=RANDOM_SEED + 100 + split_seed, max_iter=70)
    did_rep = DoubleMLDID(
        panel_dml_data,
        ml_g=ml_g_rep,
        ml_m=ml_m_rep,
        n_folds=5,
        score="observational",
    )
    did_rep.fit(store_predictions=False)
    repeated_rows.append(
        {
            "split_seed": split_seed,
            "theta_hat": float(did_rep.coef[0]),
            "std_error": float(did_rep.se[0]),
            "true_target": TRUE_PANEL_ATT,
            "bias_vs_target": float(did_rep.coef[0] - TRUE_PANEL_ATT),
        }
    )

panel_split_stability = pd.DataFrame(repeated_rows)
save_table(panel_split_stability, f"{NOTEBOOK_PREFIX}_panel_repeated_split_stability.csv")

fig, ax = plt.subplots(figsize=(9, 4.8))
sns.lineplot(data=panel_split_stability, x="split_seed", y="theta_hat", marker="o", ax=ax)
ax.axhline(TRUE_PANEL_ATT, color="#b91c1c", linestyle="--", label="True panel ATT")
ax.set_title("Panel DID Estimate Across Sample Splits")
ax.set_xlabel("Split Seed")
ax.set_ylabel("Estimated Effect")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_panel_repeated_split_stability.png", dpi=160, bbox_inches="tight")
plt.show()

display(panel_split_stability)

split_seed theta_hat std_error true_target bias_vs_target
0 0 1.174718 0.018313 1.160608 0.014110
1 1 1.170230 0.018376 1.160608 0.009622
2 2 1.166434 0.018265 1.160608 0.005826
3 3 1.167419 0.018584 1.160608 0.006812
4 4 1.166064 0.018253 1.160608 0.005457
5 5 1.173163 0.018378 1.160608 0.012555

The estimates move slightly across folds, but the movement is small relative to the treatment effect. That is the kind of stability we want before presenting a final estimate.

Placebo Outcome Check

A common DID diagnostic is a placebo pre-trend check. With only one pre period, real data would need additional historical periods. In this synthetic tutorial, we create a fake pre-period change that has no treatment effect. A credible adjusted DID procedure should estimate something close to zero on that placebo outcome.

placebo_panel_df = panel_df.copy()
placebo_panel_df["placebo_pre_delta"] = (
    0.15
    + 0.20 * placebo_panel_df["engagement_score"]
    + 0.10 * placebo_panel_df["recent_activity"]
    - 0.06 * placebo_panel_df["price_sensitivity"]
    + rng.normal(0, 0.55, len(placebo_panel_df))
)

placebo_raw_estimate, _ = raw_panel_did(placebo_panel_df, y_col="placebo_pre_delta", d_col="treated")
placebo_data = DoubleMLDIDData(placebo_panel_df, y_col="placebo_pre_delta", d_cols="treated", x_cols=feature_cols)
ml_g_placebo, ml_m_placebo = make_linear_learners()
placebo_did = DoubleMLDID(
    placebo_data,
    ml_g=ml_g_placebo,
    ml_m=ml_m_placebo,
    n_folds=5,
    score="observational",
)
placebo_did.fit(store_predictions=True)

placebo_summary = pd.DataFrame(
    [
        {"check": "raw placebo DID", "theta_hat": placebo_raw_estimate, "expected_value": 0.0},
        {"check": "DoubleML adjusted placebo DID", "theta_hat": float(placebo_did.coef[0]), "expected_value": 0.0},
    ]
)
placebo_summary["distance_from_zero"] = placebo_summary["theta_hat"].abs()
save_table(placebo_summary, f"{NOTEBOOK_PREFIX}_panel_placebo_check.csv")
display(placebo_summary)
check theta_hat expected_value distance_from_zero
0 raw placebo DID 0.148949 0.0 0.148949
1 DoubleML adjusted placebo DID -0.006641 0.0 0.006641

The adjusted placebo estimate is closer to zero than the raw placebo contrast. In real work, a nonzero placebo estimate would not automatically kill the design, but it would require a serious explanation.

Parallel Trend Violation Stress Test

Now we intentionally break the DID assumption by adding an unobserved treated-group trend shock. This shock is not explained by X, so neither OLS nor DoubleML can fix it. The point is to show the boundary of machine learning: nuisance adjustment helps with observed confounding, not hidden trend violations.

violation_rows = []
for violation_strength in [0.0, 0.10, 0.20, 0.35, 0.50]:
    stress_df = panel_df.copy()
    stress_df["delta_with_hidden_trend"] = stress_df["delta_weekly_value"] + violation_strength * stress_df["treated"]
    stress_data = DoubleMLDIDData(
        stress_df,
        y_col="delta_with_hidden_trend",
        d_cols="treated",
        x_cols=feature_cols,
    )
    ml_g_stress, ml_m_stress = make_linear_learners()
    stress_model = DoubleMLDID(
        stress_data,
        ml_g=ml_g_stress,
        ml_m=ml_m_stress,
        n_folds=5,
        score="observational",
    )
    stress_model.fit(store_predictions=False)
    raw_stress, _ = raw_panel_did(stress_df, y_col="delta_with_hidden_trend", d_col="treated")
    violation_rows.append(
        {
            "hidden_trend_added_to_treated": violation_strength,
            "raw_did": raw_stress,
            "doubleml_did": float(stress_model.coef[0]),
            "true_treatment_att": TRUE_PANEL_ATT,
            "expected_biased_target": TRUE_PANEL_ATT + violation_strength,
        }
    )

parallel_trend_stress = pd.DataFrame(violation_rows)
save_table(parallel_trend_stress, f"{NOTEBOOK_PREFIX}_parallel_trend_violation_stress.csv")

fig, ax = plt.subplots(figsize=(9, 5))
plot_stress = parallel_trend_stress.melt(
    id_vars=["hidden_trend_added_to_treated", "true_treatment_att", "expected_biased_target"],
    value_vars=["raw_did", "doubleml_did"],
    var_name="estimator",
    value_name="estimate",
)
sns.lineplot(
    data=plot_stress,
    x="hidden_trend_added_to_treated",
    y="estimate",
    hue="estimator",
    marker="o",
    ax=ax,
)
ax.plot(
    parallel_trend_stress["hidden_trend_added_to_treated"],
    parallel_trend_stress["expected_biased_target"],
    color="#6b7280",
    linestyle="--",
    label="ATT plus hidden trend",
)
ax.axhline(TRUE_PANEL_ATT, color="#b91c1c", linestyle=":", label="True treatment ATT")
ax.set_title("Stress Test: Hidden Treated-Group Trend")
ax.set_xlabel("Hidden Trend Added To Treated Group")
ax.set_ylabel("Estimated Effect")
ax.legend()
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_parallel_trend_violation_stress.png", dpi=160, bbox_inches="tight")
plt.show()

display(parallel_trend_stress)

hidden_trend_added_to_treated raw_did doubleml_did true_treatment_att expected_biased_target
0 0.00 1.363479 1.171459 1.160608 1.160608
1 0.10 1.463479 1.271211 1.160608 1.260608
2 0.20 1.563479 1.370323 1.160608 1.360608
3 0.35 1.713479 1.520546 1.160608 1.510608
4 0.50 1.863479 1.670620 1.160608 1.660608

As the hidden trend violation grows, the adjusted DID estimate follows the biased target rather than the true treatment effect. This is the central limitation of DID: the model cannot discover the missing counterfactual trend by itself.

Repeated Cross-Section Data

Now we switch to a repeated cross-section design. We no longer observe the same unit twice. Each row is one observation in either the pre or post period.

The group indicator treated marks whether an observation belongs to the group that is exposed after the policy begins. The time indicator post marks whether the observation is from the post period. Actual exposure is again treated * post.

rng_cs = np.random.default_rng(RANDOM_SEED + 1)
N_CS = 10_000

cs_post = rng_cs.binomial(1, 0.50, N_CS)
cs_engagement = rng_cs.normal(0, 1, N_CS)
cs_activity = rng_cs.normal(0, 1, N_CS)
cs_fit = rng_cs.normal(0, 1, N_CS)
cs_price = rng_cs.normal(0, 1, N_CS)
cs_tenure = rng_cs.uniform(0, 1, N_CS)
cs_mobile = rng_cs.binomial(1, 0.55, N_CS)

cs_x = pd.DataFrame(
    {
        "engagement_score": cs_engagement,
        "recent_activity": cs_activity,
        "content_fit": cs_fit,
        "price_sensitivity": cs_price,
        "tenure_signal": cs_tenure,
        "mobile_user": cs_mobile,
    }
)
cs_propensity = sigmoid(-0.10 + 0.50 * cs_engagement + 0.35 * cs_activity - 0.35 * cs_price + 0.15 * cs_mobile)
cs_treated = rng_cs.binomial(1, cs_propensity)

cs_baseline = 4.0 + 0.80 * cs_engagement + 0.50 * cs_activity + 0.70 * cs_fit - 0.40 * cs_price + 0.20 * cs_mobile
cs_expected_trend = 0.25 + 0.25 * cs_engagement + 0.18 * cs_activity + 0.12 * cs_fit - 0.08 * cs_price + 0.05 * cs_mobile
cs_tau = 1.00 + 0.35 * cs_engagement + 0.25 * cs_fit - 0.15 * cs_price + 0.10 * cs_mobile

cs_weekly_value = cs_baseline + cs_expected_trend * cs_post + cs_tau * cs_treated * cs_post + rng_cs.normal(0, 0.65, N_CS)

cs_df = pd.concat(
    [
        pd.DataFrame({"row_id": np.arange(N_CS)}),
        cs_x,
        pd.DataFrame(
            {
                "treated": cs_treated,
                "post": cs_post,
                "actual_exposure": cs_treated * cs_post,
                "weekly_value": cs_weekly_value,
                "expected_untreated_trend": cs_expected_trend,
                "true_tau": cs_tau,
                "true_propensity": cs_propensity,
            }
        ),
    ],
    axis=1,
)

TRUE_CS_ATT = cs_df.loc[cs_df["treated"] == 1, "true_tau"].mean()
TRUE_CS_ATE = cs_df["true_tau"].mean()

cs_path = DATASET_DIR / f"{NOTEBOOK_PREFIX}_synthetic_repeated_cs_did.csv"
cs_df.to_csv(cs_path, index=False)

print(f"Repeated cross-section rows: {len(cs_df):,}")
print(f"True repeated cross-section ATT: {TRUE_CS_ATT:.4f}")
print(f"True repeated cross-section ATE: {TRUE_CS_ATE:.4f}")
print(f"Saved repeated cross-section data to: {cs_path.relative_to(PROJECT_ROOT)}")
display(cs_df.head())
Repeated cross-section rows: 10,000
True repeated cross-section ATT: 1.1592
True repeated cross-section ATE: 1.0478
Saved repeated cross-section data to: notebooks/tutorials/doubleml/outputs/datasets/07_synthetic_repeated_cs_did.csv
row_id engagement_score recent_activity content_fit price_sensitivity tenure_signal mobile_user treated post actual_exposure weekly_value expected_untreated_trend true_tau true_propensity
0 0 0.367609 0.424735 -1.386468 0.482572 0.317388 1 1 1 1 4.052319 0.263373 0.809660 0.553188
1 1 1.716114 1.940829 -1.712124 0.622154 0.023020 1 1 0 0 5.214719 0.823151 1.179286 0.797314
2 2 -1.491408 0.459025 0.579608 0.075518 0.023211 0 0 0 0 3.721336 0.023284 0.611581 0.329273
3 3 2.219092 0.681663 -0.414447 -0.007063 0.720072 0 0 1 0 6.626428 0.878304 1.674130 0.777404
4 4 -1.688694 -2.586733 1.250442 -0.811638 0.334672 1 0 1 0 3.360712 -0.372801 0.943313 0.195348

The repeated cross-section dataset has no y_pre or y_post columns for the same unit. That is why the estimator has to model outcome levels in the four group-time cells rather than within-unit changes.

Repeated Cross-Section Field Dictionary

The field roles are similar to the panel design, but the time column is now explicit. post must be passed to the DoubleML data backend for repeated cross-section DID.

cs_field_dictionary = pd.DataFrame(
    [
        {"field": "row_id", "role": "identifier", "description": "Unique row in the repeated cross-section sample."},
        {"field": "treated", "role": "group indicator", "description": "Equals 1 for observations from the group exposed in the post period."},
        {"field": "post", "role": "time indicator", "description": "Equals 1 for post-period observations."},
        {"field": "actual_exposure", "role": "realized treatment", "description": "Interaction treated x post."},
        {"field": "weekly_value", "role": "outcome", "description": "Observed outcome level for that row and period."},
        {"field": "feature columns", "role": "covariates", "description": "Observed pre-treatment or baseline covariates used for conditional DID."},
        {"field": "expected_untreated_trend", "role": "simulation truth", "description": "Expected no-treatment trend given observed covariates."},
        {"field": "true_tau", "role": "simulation truth", "description": "Individual post-period treatment effect."},
        {"field": "true_propensity", "role": "simulation truth", "description": "True probability of belonging to the treated group."},
    ]
)
save_table(cs_field_dictionary, f"{NOTEBOOK_PREFIX}_cs_field_dictionary.csv")
display(cs_field_dictionary)
field role description
0 row_id identifier Unique row in the repeated cross-section sample.
1 treated group indicator Equals 1 for observations from the group expos...
2 post time indicator Equals 1 for post-period observations.
3 actual_exposure realized treatment Interaction treated x post.
4 weekly_value outcome Observed outcome level for that row and period.
5 feature columns covariates Observed pre-treatment or baseline covariates ...
6 expected_untreated_trend simulation truth Expected no-treatment trend given observed cov...
7 true_tau simulation truth Individual post-period treatment effect.
8 true_propensity simulation truth True probability of belonging to the treated g...

The important structural difference is that repeated cross-section DID must know both treated and post. The package uses t_col="post" to identify the time indicator.

Repeated Cross-Section Audit And Cell Means

A repeated cross-section DID audit should check the four cells. If one group-time cell is sparse, the design can become unstable.

cs_cell_counts = cs_df.groupby(["treated", "post"]).size().rename("n").reset_index()
cs_raw_estimate, cs_cell_means = raw_repeated_cs_did(cs_df)

cs_audit = pd.DataFrame(
    [
        {"metric": "rows", "value": len(cs_df)},
        {"metric": "missing_cells", "value": int(cs_df.isna().sum().sum())},
        {"metric": "treated_share", "value": cs_df["treated"].mean()},
        {"metric": "post_share", "value": cs_df["post"].mean()},
        {"metric": "raw_repeated_cs_did", "value": cs_raw_estimate},
        {"metric": "true_cs_att", "value": TRUE_CS_ATT},
    ]
)
save_table(cs_cell_counts, f"{NOTEBOOK_PREFIX}_cs_cell_counts.csv")
save_table(cs_cell_means.reset_index(), f"{NOTEBOOK_PREFIX}_cs_cell_means.csv")
save_table(cs_audit, f"{NOTEBOOK_PREFIX}_cs_data_audit.csv")

display(cs_cell_counts)
display(cs_cell_means)
display(cs_audit)
treated post n
0 0 0 2496
1 0 1 2519
2 1 0 2457
3 1 1 2528
post 0 1
treated
0 3.771595 3.953479
1 4.462902 5.939473
metric value
0 rows 10000.000000
1 missing_cells 0.000000
2 treated_share 0.498500
3 post_share 0.504700
4 raw_repeated_cs_did 1.294686
5 true_cs_att 1.159164

All four cells are well populated. The raw repeated cross-section DID estimate is informative but still vulnerable to covariate-driven trends because treated and control group membership is observational.

Repeated Cross-Section Four-Cell Plot

The repeated cross-section plot is the classic DID picture. Each line connects group means from pre to post, even though the underlying units are not the same people over time.

cs_means = (
    cs_df.groupby(["treated", "post"])["weekly_value"]
    .agg(mean="mean", se=lambda x: x.std(ddof=1) / np.sqrt(len(x)), n="size")
    .reset_index()
)
cs_means["period"] = cs_means["post"].map({0: "pre", 1: "post"})
save_table(cs_means, f"{NOTEBOOK_PREFIX}_cs_four_cell_means.csv")

fig, ax = plt.subplots(figsize=(9, 5))
sns.pointplot(
    data=cs_means,
    x="period",
    y="mean",
    hue="treated",
    order=["pre", "post"],
    markers=["o", "s"],
    dodge=0.05,
    errorbar=None,
    ax=ax,
)
ax.set_title("Repeated Cross-Section Outcomes By Group And Period")
ax.set_xlabel("Period")
ax.set_ylabel("Mean Outcome")
ax.legend(title="Treated group")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_cs_four_cell_means.png", dpi=160, bbox_inches="tight")
plt.show()

display(cs_means)

treated post mean se n period
0 0 0 3.771595 0.027247 2496 pre
1 0 1 3.953479 0.033645 2519 post
2 1 0 4.462902 0.027094 2457 pre
3 1 1 5.939473 0.039852 2528 post

The treated group’s post-period jump is visibly larger. The causal question is whether that extra jump is due to treatment exposure or to the treated group having a different covariate composition and natural trend.

Fit DoubleMLDIDCS

For repeated cross sections, the backend receives the outcome level, the treated-group indicator, the time indicator, and covariates. The estimator learns outcome nuisance functions for the relevant group-time cells and a propensity model for treated-group membership.

cs_dml_data = DoubleMLDIDData(
    cs_df,
    y_col="weekly_value",
    d_cols="treated",
    t_col="post",
    x_cols=feature_cols,
)

cs_backend_summary = pd.DataFrame(
    [
        {"property": "n_obs", "value": cs_dml_data.n_obs},
        {"property": "outcome", "value": cs_dml_data.y_col},
        {"property": "treatment", "value": ", ".join(cs_dml_data.d_cols)},
        {"property": "time", "value": cs_dml_data.t_col},
        {"property": "n_covariates", "value": len(cs_dml_data.x_cols)},
    ]
)
save_table(cs_backend_summary, f"{NOTEBOOK_PREFIX}_cs_doublemldata_backend.csv")
display(cs_backend_summary)
property value
0 n_obs 10000
1 outcome weekly_value
2 treatment treated
3 time post
4 n_covariates 6

The time column is the main backend difference from panel DID. In panel DID, time has already been used to create the outcome change. In repeated cross-section DID, time stays in the data object.

Repeated Cross-Section DoubleML Estimates

We fit linear and gradient boosting nuisance versions again. The estimates will not match the panel estimates exactly because this is a different sample and a different data structure.

cs_model_rows = []
cs_models = {}

cs_g_linear, cs_m_linear = make_linear_learners()
cs_did_linear = DoubleMLDIDCS(
    cs_dml_data,
    ml_g=cs_g_linear,
    ml_m=cs_m_linear,
    n_folds=5,
    score="observational",
)
cs_did_linear.fit(store_predictions=True)
cs_models["Repeated CS observational linear"] = cs_did_linear
cs_model_rows.append(summarize_doubleml(cs_did_linear, "Repeated CS observational linear", "repeated cross-section", TRUE_CS_ATT))

cs_g_hgb, cs_m_hgb = make_hgb_learners(seed=RANDOM_SEED + 31, max_iter=90)
cs_did_hgb = DoubleMLDIDCS(
    cs_dml_data,
    ml_g=cs_g_hgb,
    ml_m=cs_m_hgb,
    n_folds=5,
    score="observational",
)
cs_did_hgb.fit(store_predictions=True)
cs_models["Repeated CS observational HGB"] = cs_did_hgb
cs_model_rows.append(summarize_doubleml(cs_did_hgb, "Repeated CS observational HGB", "repeated cross-section", TRUE_CS_ATT))

cs_g_exp, _ = make_hgb_learners(seed=RANDOM_SEED + 41, max_iter=90)
cs_did_experimental = DoubleMLDIDCS(
    cs_dml_data,
    ml_g=cs_g_exp,
    n_folds=5,
    score="experimental",
)
cs_did_experimental.fit(store_predictions=True)
cs_models["Repeated CS experimental HGB"] = cs_did_experimental
cs_model_rows.append(summarize_doubleml(cs_did_experimental, "Repeated CS experimental HGB", "repeated cross-section", TRUE_CS_ATT))

cs_doubleml_summary = pd.DataFrame(cs_model_rows)
save_table(cs_doubleml_summary, f"{NOTEBOOK_PREFIX}_cs_doubleml_did_summary.csv")
display(cs_doubleml_summary)
estimator design theta_hat std_error ci_95_lower ci_95_upper true_target bias_vs_target
0 Repeated CS observational linear repeated cross-section 1.142845 0.031536 1.081036 1.204653 1.159164 -0.016320
1 Repeated CS observational HGB repeated cross-section 1.095756 0.041543 1.014332 1.177179 1.159164 -0.063409
2 Repeated CS experimental HGB repeated cross-section 1.064566 0.028160 1.009373 1.119759 1.159164 -0.094598

The repeated cross-section estimates are close to the true ATT, with some sampling and nuisance-model variation. This section also shows why repeated cross sections usually need more care than panels: we are modeling levels in group-time cells, not differencing the same units.

Repeated Cross-Section Nuisance Diagnostics

The repeated cross-section estimator has more outcome nuisance functions because it models outcome levels for combinations of group and time. The learner names in the table encode those cells.

cs_loss_tables = []
for label, model in cs_models.items():
    cs_loss_tables.append(learner_loss_table(model, label))

cs_nuisance_losses = pd.concat(cs_loss_tables, ignore_index=True)
save_table(cs_nuisance_losses, f"{NOTEBOOK_PREFIX}_cs_nuisance_losses.csv")
display(cs_nuisance_losses)
model learner metric_value metric
0 Repeated CS observational linear ml_g_d0_t0 0.643057 RMSE for outcome learners; classification erro...
1 Repeated CS observational linear ml_g_d0_t1 0.666071 RMSE for outcome learners; classification erro...
2 Repeated CS observational linear ml_g_d1_t0 0.636959 RMSE for outcome learners; classification erro...
3 Repeated CS observational linear ml_g_d1_t1 0.648046 RMSE for outcome learners; classification erro...
4 Repeated CS observational linear ml_m 0.472545 RMSE for outcome learners; classification erro...
5 Repeated CS observational HGB ml_g_d0_t0 0.685391 RMSE for outcome learners; classification erro...
6 Repeated CS observational HGB ml_g_d0_t1 0.728344 RMSE for outcome learners; classification erro...
7 Repeated CS observational HGB ml_g_d1_t0 0.688262 RMSE for outcome learners; classification erro...
8 Repeated CS observational HGB ml_g_d1_t1 0.720478 RMSE for outcome learners; classification erro...
9 Repeated CS observational HGB ml_m 0.475287 RMSE for outcome learners; classification erro...
10 Repeated CS experimental HGB ml_g_d0_t0 0.684433 RMSE for outcome learners; classification erro...
11 Repeated CS experimental HGB ml_g_d0_t1 0.725038 RMSE for outcome learners; classification erro...
12 Repeated CS experimental HGB ml_g_d1_t0 0.686441 RMSE for outcome learners; classification erro...
13 Repeated CS experimental HGB ml_g_d1_t1 0.721103 RMSE for outcome learners; classification erro...

The four ml_g rows correspond to outcome models for combinations of treated/control and pre/post. If one group-time cell had poor predictive performance or very few rows, the final DID estimate would deserve extra caution.

Combined DID Summary

Now we combine the main panel and repeated cross-section results in one table. This is the table a reader would likely want near the top of a report.

cs_raw_summary = pd.DataFrame(
    [
        {
            "estimator": "Raw repeated CS DID",
            "design": "repeated cross-section",
            "theta_hat": cs_raw_estimate,
            "std_error": np.nan,
            "ci_95_lower": np.nan,
            "ci_95_upper": np.nan,
            "true_target": TRUE_CS_ATT,
            "bias_vs_target": cs_raw_estimate - TRUE_CS_ATT,
        }
    ]
)

combined_did_summary = pd.concat(
    [
        panel_estimates.assign(section="panel"),
        cs_raw_summary.assign(section="repeated cross-section"),
        cs_doubleml_summary.assign(section="repeated cross-section"),
    ],
    ignore_index=True,
    sort=False,
)
save_table(combined_did_summary, f"{NOTEBOOK_PREFIX}_combined_did_summary.csv")
display(combined_did_summary)
estimator design theta_hat true_target bias_vs_target std_error ci_95_lower ci_95_upper section
0 Raw panel DID panel 1.363479 1.160608 0.202871 0.019195 1.325856 1.401101 panel
1 OLS on outcome change with covariates panel 1.075332 1.160608 -0.085276 0.016391 1.043206 1.107457 panel
2 Manual cross-fitted DR panel DID panel 1.170154 1.160608 0.009546 0.022568 1.125921 1.214386 panel
3 Panel observational linear panel 1.171624 1.160608 0.011016 0.018217 1.135920 1.207328 panel
4 Panel observational HGB panel 1.171300 1.160608 0.010692 0.018574 1.134895 1.207704 panel
5 Panel experimental HGB panel 1.087835 1.160608 -0.072772 0.015517 1.057423 1.118247 panel
6 Raw repeated CS DID repeated cross-section 1.294686 1.159164 0.135522 NaN NaN NaN repeated cross-section
7 Repeated CS observational linear repeated cross-section 1.142845 1.159164 -0.016320 0.031536 1.081036 1.204653 repeated cross-section
8 Repeated CS observational HGB repeated cross-section 1.095756 1.159164 -0.063409 0.041543 1.014332 1.177179 repeated cross-section
9 Repeated CS experimental HGB repeated cross-section 1.064566 1.159164 -0.094598 0.028160 1.009373 1.119759 repeated cross-section

The combined table should be read by design, not as a horse race. Panel DID and repeated cross-section DID answer related questions under different data structures.

Combined Estimate Plot

The final comparison plot separates panel and repeated cross-section estimates by color. The target lines differ because the true treated-group effect is computed in each synthetic sample.

plot_combined = combined_did_summary.dropna(subset=["theta_hat"]).copy()
plot_combined["label"] = plot_combined["section"] + " | " + plot_combined["estimator"]
plot_combined = add_ci_columns(plot_combined)

fig, ax = plt.subplots(figsize=(12, 7))
order = plot_combined.sort_values(["section", "theta_hat"], ascending=[True, False])["label"]
sns.pointplot(
    data=plot_combined,
    y="label",
    x="theta_hat",
    hue="section",
    order=order,
    linestyle="none",
    dodge=False,
    errorbar=None,
    ax=ax,
)
for y_pos, (_, row) in enumerate(plot_combined.set_index("label").loc[order].reset_index().iterrows()):
    if pd.notna(row["ci_95_lower"]):
        ax.errorbar(
            x=row["theta_hat"],
            y=y_pos,
            xerr=[[row["lower_error"]], [row["upper_error"]]],
            fmt="none",
            color="#475569",
            capsize=3,
            linewidth=1.1,
        )
ax.axvline(TRUE_PANEL_ATT, color="#b91c1c", linestyle="--", linewidth=1.4, label="True panel ATT")
ax.axvline(TRUE_CS_ATT, color="#2563eb", linestyle="--", linewidth=1.4, label="True repeated CS ATT")
ax.set_title("DID Estimates Across Data Structures")
ax.set_xlabel("Estimated Effect")
ax.set_ylabel("")
ax.legend(loc="best")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_combined_did_estimates.png", dpi=160, bbox_inches="tight")
plt.show()

The adjusted estimates cluster near their design-specific true targets. The raw estimates remain useful references, but they should not be treated as the main answer when assignment is observational and covariates predict trends.

What To Report In A DID Analysis

A DID report should not stop at a coefficient. It should tell readers what comparison created the estimate and why the counterfactual trend assumption is credible.

did_reporting_checklist = pd.DataFrame(
    [
        {"topic": "estimand", "question": "Is the target ATT, ATE, or another effect?", "notebook_answer": "ATT for treated-group units in the post period."},
        {"topic": "data structure", "question": "Is the design panel or repeated cross-section?", "notebook_answer": "Both are demonstrated separately."},
        {"topic": "treatment timing", "question": "Is exposure equal to treated group times post period?", "notebook_answer": "Yes. Actual exposure is treated x post."},
        {"topic": "parallel trends", "question": "Why should untreated trends be comparable?", "notebook_answer": "In the synthetic design, untreated trends depend on observed X."},
        {"topic": "overlap", "question": "Do treated and control groups share covariate support?", "notebook_answer": "Propensity plots show adequate overlap."},
        {"topic": "placebo", "question": "Do pre-period placebo checks look near zero?", "notebook_answer": "The adjusted placebo estimate is close to zero."},
        {"topic": "sensitivity", "question": "What happens under hidden trend violations?", "notebook_answer": "Estimates move with the hidden trend, showing the design limit."},
        {"topic": "uncertainty", "question": "Are confidence intervals and split stability reported?", "notebook_answer": "Yes. Analytic, bootstrap, and split-stability outputs are saved."},
    ]
)
save_table(did_reporting_checklist, f"{NOTEBOOK_PREFIX}_did_reporting_checklist.csv")
display(did_reporting_checklist)
topic question notebook_answer
0 estimand Is the target ATT, ATE, or another effect? ATT for treated-group units in the post period.
1 data structure Is the design panel or repeated cross-section? Both are demonstrated separately.
2 treatment timing Is exposure equal to treated group times post ... Yes. Actual exposure is treated x post.
3 parallel trends Why should untreated trends be comparable? In the synthetic design, untreated trends depe...
4 overlap Do treated and control groups share covariate ... Propensity plots show adequate overlap.
5 placebo Do pre-period placebo checks look near zero? The adjusted placebo estimate is close to zero.
6 sensitivity What happens under hidden trend violations? Estimates move with the hidden trend, showing ...
7 uncertainty Are confidence intervals and split stability r... Yes. Analytic, bootstrap, and split-stability ...

This checklist is the bridge from notebook work to credible communication. A reader should be able to tell what would need to be true for the estimate to have a causal reading.

Report Template

The cell below writes a short Markdown report template. The values are filled from the notebook’s main outputs, but the assumption language is deliberately cautious.

best_panel_row = panel_doubleml_summary.loc[panel_doubleml_summary["estimator"] == "Panel observational HGB"].iloc[0]
best_cs_row = cs_doubleml_summary.loc[cs_doubleml_summary["estimator"] == "Repeated CS observational HGB"].iloc[0]

report_text = f"""
# Difference-In-Differences DoubleML Report Template

## Question

Estimate the post-period effect for units in the treated group using DID designs with flexible nuisance adjustment.

## Panel DID Result

The preferred panel estimate is the observational `DoubleMLDID` model with gradient boosting nuisance learners.

- Estimate: {best_panel_row['theta_hat']:.4f}
- Standard error: {best_panel_row['std_error']:.4f}
- 95 percent CI: [{best_panel_row['ci_95_lower']:.4f}, {best_panel_row['ci_95_upper']:.4f}]
- Synthetic true ATT: {TRUE_PANEL_ATT:.4f}

## Repeated Cross-Section DID Result

The preferred repeated cross-section estimate is the observational `DoubleMLDIDCS` model with gradient boosting nuisance learners.

- Estimate: {best_cs_row['theta_hat']:.4f}
- Standard error: {best_cs_row['std_error']:.4f}
- 95 percent CI: [{best_cs_row['ci_95_lower']:.4f}, {best_cs_row['ci_95_upper']:.4f}]
- Synthetic true ATT: {TRUE_CS_ATT:.4f}

## Identification Assumptions

The estimates rely on conditional parallel trends: after conditioning on observed pre-treatment covariates, treated and control groups would have had comparable untreated trends. The estimates also require overlap, stable measurement, and no hidden group-specific shocks aligned with treatment timing.

## Diagnostics To Include

- Covariate balance by treated group.
- Propensity overlap plots.
- Raw DID versus adjusted DID comparison.
- Placebo pre-period check when historical outcomes are available.
- Stress test or sensitivity discussion for hidden trend violations.
- Clear statement of the target population.
""".strip()

report_path = REPORT_DIR / f"{NOTEBOOK_PREFIX}_did_report_template.md"
report_path.write_text(report_text)
print(report_text)
# Difference-In-Differences DoubleML Report Template

## Question

Estimate the post-period effect for units in the treated group using DID designs with flexible nuisance adjustment.

## Panel DID Result

The preferred panel estimate is the observational `DoubleMLDID` model with gradient boosting nuisance learners.

- Estimate: 1.1713
- Standard error: 0.0186
- 95 percent CI: [1.1349, 1.2077]
- Synthetic true ATT: 1.1606

## Repeated Cross-Section DID Result

The preferred repeated cross-section estimate is the observational `DoubleMLDIDCS` model with gradient boosting nuisance learners.

- Estimate: 1.0958
- Standard error: 0.0415
- 95 percent CI: [1.0143, 1.1772]
- Synthetic true ATT: 1.1592

## Identification Assumptions

The estimates rely on conditional parallel trends: after conditioning on observed pre-treatment covariates, treated and control groups would have had comparable untreated trends. The estimates also require overlap, stable measurement, and no hidden group-specific shocks aligned with treatment timing.

## Diagnostics To Include

- Covariate balance by treated group.
- Propensity overlap plots.
- Raw DID versus adjusted DID comparison.
- Placebo pre-period check when historical outcomes are available.
- Stress test or sensitivity discussion for hidden trend violations.
- Clear statement of the target population.

The report template keeps the coefficient tied to the design. That is the habit to build: every causal estimate should travel with its target, assumptions, diagnostics, and limitations.

Artifact Manifest

The final cell lists the files produced by this notebook. This makes the tutorial easy to audit and easy to reuse in a portfolio or report workflow.

artifact_rows = []
for folder, artifact_type in [
    (DATASET_DIR, "dataset"),
    (TABLE_DIR, "table"),
    (FIGURE_DIR, "figure"),
    (REPORT_DIR, "report"),
]:
    for file_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": file_path.name,
                "relative_path": str(file_path.relative_to(PROJECT_ROOT)),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
save_table(artifact_manifest, f"{NOTEBOOK_PREFIX}_artifact_manifest.csv")
display(artifact_manifest)
artifact_type file_name relative_path
0 dataset 07_synthetic_panel_did.csv notebooks/tutorials/doubleml/outputs/datasets/...
1 dataset 07_synthetic_repeated_cs_did.csv notebooks/tutorials/doubleml/outputs/datasets/...
2 table 07_artifact_manifest.csv notebooks/tutorials/doubleml/outputs/tables/07...
3 table 07_combined_did_summary.csv notebooks/tutorials/doubleml/outputs/tables/07...
4 table 07_cs_cell_counts.csv notebooks/tutorials/doubleml/outputs/tables/07...
5 table 07_cs_cell_means.csv notebooks/tutorials/doubleml/outputs/tables/07...
6 table 07_cs_data_audit.csv notebooks/tutorials/doubleml/outputs/tables/07...
7 table 07_cs_doubleml_did_summary.csv notebooks/tutorials/doubleml/outputs/tables/07...
8 table 07_cs_doublemldata_backend.csv notebooks/tutorials/doubleml/outputs/tables/07...
9 table 07_cs_field_dictionary.csv notebooks/tutorials/doubleml/outputs/tables/07...
10 table 07_cs_four_cell_means.csv notebooks/tutorials/doubleml/outputs/tables/07...
11 table 07_cs_nuisance_losses.csv notebooks/tutorials/doubleml/outputs/tables/07...
12 table 07_did_reporting_checklist.csv notebooks/tutorials/doubleml/outputs/tables/07...
13 table 07_manual_panel_diagnostics.csv notebooks/tutorials/doubleml/outputs/tables/07...
14 table 07_manual_panel_dr_did.csv notebooks/tutorials/doubleml/outputs/tables/07...
15 table 07_package_versions.csv notebooks/tutorials/doubleml/outputs/tables/07...
16 table 07_panel_change_by_group.csv notebooks/tutorials/doubleml/outputs/tables/07...
17 table 07_panel_covariate_balance.csv notebooks/tutorials/doubleml/outputs/tables/07...
18 table 07_panel_data_audit.csv notebooks/tutorials/doubleml/outputs/tables/07...
19 table 07_panel_doubleml_did_summary.csv notebooks/tutorials/doubleml/outputs/tables/07...
20 table 07_panel_doublemldata_backend.csv notebooks/tutorials/doubleml/outputs/tables/07...
21 table 07_panel_estimate_comparison.csv notebooks/tutorials/doubleml/outputs/tables/07...
22 table 07_panel_field_dictionary.csv notebooks/tutorials/doubleml/outputs/tables/07...
23 table 07_panel_four_cell_means.csv notebooks/tutorials/doubleml/outputs/tables/07...
24 table 07_panel_hgb_bootstrap_ci.csv notebooks/tutorials/doubleml/outputs/tables/07...
25 table 07_panel_hgb_nuisance_quality.csv notebooks/tutorials/doubleml/outputs/tables/07...
26 table 07_panel_nuisance_losses.csv notebooks/tutorials/doubleml/outputs/tables/07...
27 table 07_panel_ols_baseline.csv notebooks/tutorials/doubleml/outputs/tables/07...
28 table 07_panel_placebo_check.csv notebooks/tutorials/doubleml/outputs/tables/07...
29 table 07_panel_repeated_split_stability.csv notebooks/tutorials/doubleml/outputs/tables/07...
30 table 07_panel_target_summary.csv notebooks/tutorials/doubleml/outputs/tables/07...
31 table 07_parallel_trend_violation_stress.csv notebooks/tutorials/doubleml/outputs/tables/07...
32 table 07_raw_panel_did_summary.csv notebooks/tutorials/doubleml/outputs/tables/07...
33 figure 07_combined_did_estimates.png notebooks/tutorials/doubleml/outputs/figures/0...
34 figure 07_cs_four_cell_means.png notebooks/tutorials/doubleml/outputs/figures/0...
35 figure 07_did_design_dag.png notebooks/tutorials/doubleml/outputs/figures/0...
36 figure 07_manual_panel_score_diagnostics.png notebooks/tutorials/doubleml/outputs/figures/0...
37 figure 07_panel_estimate_comparison.png notebooks/tutorials/doubleml/outputs/figures/0...
38 figure 07_panel_estimated_propensity_overlap.png notebooks/tutorials/doubleml/outputs/figures/0...
39 figure 07_panel_four_cell_means.png notebooks/tutorials/doubleml/outputs/figures/0...
40 figure 07_panel_repeated_split_stability.png notebooks/tutorials/doubleml/outputs/figures/0...
41 figure 07_panel_true_propensity_overlap.png notebooks/tutorials/doubleml/outputs/figures/0...
42 figure 07_parallel_trend_violation_stress.png notebooks/tutorials/doubleml/outputs/figures/0...
43 report 07_did_report_template.md notebooks/tutorials/doubleml/outputs/reports/0...

The DID notebook is complete. The next natural topic is sample-selection models, where the challenge shifts from before-after counterfactual trends to outcomes that are only observed for selected units.