04 - Heterogeneous Treatment Effects

Goal: estimate where top-3 ranking exposure appears to matter more or less.

The previous notebooks estimated an average effect of is_top_3 on clicked. That global estimate is useful, but product teams usually need a more targeted answer:

For which users, items, and contexts does higher ranking position create the most incremental engagement?

This notebook uses doubly robust/AIPW scores to estimate segment-level effects across content categories, user-history buckets, candidate-set-size buckets, item-exposure buckets, and time-of-day buckets.

Why Heterogeneous Effects Matter

A single average treatment effect can hide important product differences. For example, top-3 placement might be very valuable for fresh entertainment content, but less valuable for items that users would click regardless of position. It might also matter more when users face long recommendation lists because lower positions become harder to notice.

For a recommendation data science portfolio project, this notebook is important because it turns the causal estimate into a product decision framework:

Which segments are most position-sensitive?
Which segments have little incremental benefit from top placement?
Where might ranking changes produce the biggest engagement return?
Where do we need more data or better overlap before making claims?

Notebook Setup

This cell imports the libraries used for heterogeneous effect estimation. Most of the workflow is standard pandas/numpy analysis plus scikit-learn nuisance models. We reuse the doubly robust logic from notebook 3 so notebook 4 can run independently.

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, brier_score_loss, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

sns.set_theme(style="whitegrid")
pd.set_option("display.max_columns", 100)
pd.set_option("display.float_format", "{:.4f}".format)

This cell prepares the notebook environment for heterogeneous treatment effects across product segments. There is no substantive model result yet; the important outcome is that the imports and display settings are ready so the next cells can focus on the data and causal question.

Load The Processed Impression Table

This cell loads the processed MIND-small sample created earlier. Each row is one displayed item inside a recommendation impression. That means each row has a rank position, a click outcome, item metadata, user-history context, and impression context.

DATA_RELATIVE_PATH = Path("data/processed/mind_small_impressions_train_sample.parquet")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / DATA_RELATIVE_PATH).exists()
)

DATA_PATH = PROJECT_ROOT / DATA_RELATIVE_PATH
df = pd.read_parquet(DATA_PATH)

df.shape

(737762, 20)

The loaded table preview and shape confirm that the notebook is using the expected processed dataset. This check anchors the rest of the analysis, because all treatment, outcome, and covariate definitions depend on these columns being present and correctly typed.

Causal Setup

We keep the same causal setup as notebooks 2 and 3:

Treatment: is_top_3 = 1, item appears in positions 1, 2, or 3.
Control: is_top_3 = 0, item appears below position 3.
Outcome: clicked, whether the displayed item was clicked.
Covariates: observed item, user-history, and context features.

The new question is not only the global average effect. We want segment-specific effects:

E[Y(1) - Y(0) | segment]

For example, this could mean the top-3 effect within sports articles, within low-history users, or within long candidate sets.

Modeling Sample

This notebook recomputes cross-fitted AIPW scores so it can run independently from notebook 3. To keep the notebook interactive, we use a random modeling sample. The sample is large enough for segment analysis, but small enough that cross-fitting remains quick.

Create Treatment, Outcome, And Basic Features

This cell samples rows and creates clean analysis columns. treatment and outcome are explicit versions of is_top_3 and clicked. log_item_exposures is a log-transformed exposure count used as a popularity proxy. The readable treatment_label is used in tables and plots.

MODEL_SAMPLE_SIZE = 150_000
RANDOM_STATE = 42

model_df = (
    df.sample(n=min(len(df), MODEL_SAMPLE_SIZE), random_state=RANDOM_STATE)
    .reset_index(drop=True)
    .copy()
)

model_df["treatment"] = model_df["is_top_3"].astype(int)
model_df["outcome"] = model_df["clicked"].astype(int)
model_df["log_item_exposures"] = np.log1p(model_df["item_exposures"])
model_df["treatment_label"] = np.where(model_df["treatment"] == 1, "top_3", "rank_4_plus")

pd.Series(
    {
        "rows": len(model_df),
        "treatment_rate_top_3": model_df["treatment"].mean(),
        "click_rate": model_df["outcome"].mean(),
        "unique_users": model_df["user_id"].nunique(),
        "unique_items": model_df["news_id"].nunique(),
    }
)

rows                   150000.0000
treatment_rate_top_3        0.0801
click_rate                  0.0396
unique_users            14029.0000
unique_items             6862.0000
dtype: float64

The size summary tells us the scale of the analysis population and whether the sample is large enough for ranking-position comparisons. This gives context for the more detailed distribution and treatment checks that follow.

Nuisance Model Features

AIPW needs a propensity model and an outcome model. We use observed user, item, and context covariates. As before, we avoid item_clicks and item_ctr as model inputs because they are computed from click outcomes in this same sample. They are useful descriptive columns, but not ideal nuisance-model inputs for a causal estimate.

Define Feature Lists

This cell defines the features used by the propensity and outcome models. The propensity model predicts treatment from covariates. The outcome model predicts clicks from treatment plus the same covariates. Numeric and categorical features are listed separately because they get different preprocessing.

numeric_features = [
    "history_len",
    "candidate_set_size",
    "title_length",
    "abstract_length",
    "hour",
    "day_of_week",
    "log_item_exposures",
]
categorical_features = ["category", "subcategory"]

propensity_features = numeric_features + categorical_features
outcome_numeric_features = numeric_features + ["treatment"]
outcome_features = outcome_numeric_features + categorical_features

X = model_df[propensity_features]
t = model_df["treatment"]

propensity_features, outcome_features

(['history_len',
  'candidate_set_size',
  'title_length',
  'abstract_length',
  'hour',
  'day_of_week',
  'log_item_exposures',
  'category',
  'subcategory'],
 ['history_len',
  'candidate_set_size',
  'title_length',
  'abstract_length',
  'hour',
  'day_of_week',
  'log_item_exposures',
  'treatment',
  'category',
  'subcategory'])

The feature lists define what information is allowed into the adjustment models. These are pre-treatment or contextual variables intended to reduce confounding without using the outcome itself as an input.

Reusable Model Pipelines

We use logistic regression for both nuisance models in this notebook. This keeps the model transparent and quick to fit. The goal is not to maximize predictive performance yet; the goal is to produce stable, explainable segment-level doubly robust estimates.

Build Preprocessing And Logistic Regression Pipelines

This cell defines helper functions for model creation. Numeric columns are imputed and scaled. Categorical columns are imputed and one-hot encoded, with rare levels grouped. The helper functions return fresh sklearn pipelines so cross-fitting can train separate models per fold.

def make_preprocessor(numeric_cols, categorical_cols):
    numeric_pipeline = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]
    )
    categorical_pipeline = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            (
                "onehot",
                OneHotEncoder(
                    handle_unknown="infrequent_if_exist",
                    min_frequency=50,
                    sparse_output=True,
                ),
            ),
        ]
    )
    return ColumnTransformer(
        transformers=[
            ("num", numeric_pipeline, numeric_cols),
            ("cat", categorical_pipeline, categorical_cols),
        ]
    )


def make_logistic_pipeline(numeric_cols, categorical_cols):
    return Pipeline(
        steps=[
            ("preprocess", make_preprocessor(numeric_cols, categorical_cols)),
            (
                "model",
                LogisticRegression(
                    max_iter=500,
                    solver="lbfgs",
                    n_jobs=-1,
                    random_state=RANDOM_STATE,
                ),
            ),
        ]
    )


base_propensity_model = make_logistic_pipeline(numeric_features, categorical_features)
base_outcome_model = make_logistic_pipeline(outcome_numeric_features, categorical_features)

This cell creates reusable modeling machinery rather than a final result. The value is consistency: the same preprocessing and helper functions can be applied across folds, estimators, and sensitivity checks.

Cross-Fitted AIPW Scores

To estimate segment-level effects, we first need one AIPW score per row. The mean of AIPW scores within a segment estimates the treatment effect for that segment.

We use cross-fitting so each row’s nuisance predictions are generated by models that did not train on that row. This reduces overfitting bias and makes the per-row scores more defensible.

Train Nuisance Models And Predict Held-Out Rows

This cell runs 3-fold cross-fitting. For each fold, it fits a propensity model and an outcome model on the training folds, then predicts on the held-out fold. For the outcome model, each held-out row is scored twice: once as if treatment = 1 and once as if treatment = 0. These are the counterfactual predictions mu1_hat and mu0_hat.

N_FOLDS = 3
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=RANDOM_STATE)

e_hat = np.zeros(len(model_df))
mu1_hat = np.zeros(len(model_df))
mu0_hat = np.zeros(len(model_df))

propensity_metrics = []
outcome_metrics = []

for fold, (train_idx, valid_idx) in enumerate(skf.split(X, t), start=1):
    train_df = model_df.iloc[train_idx]
    valid_df = model_df.iloc[valid_idx]

    propensity_model = clone(base_propensity_model)
    propensity_model.fit(train_df[propensity_features], train_df["treatment"])
    e_valid = propensity_model.predict_proba(valid_df[propensity_features])[:, 1]
    e_hat[valid_idx] = e_valid

    outcome_model = clone(base_outcome_model)
    outcome_model.fit(train_df[outcome_features], train_df["outcome"])

    valid_actual = valid_df[outcome_features]
    y_valid_hat = outcome_model.predict_proba(valid_actual)[:, 1]

    valid_treated = valid_df[propensity_features].copy()
    valid_treated["treatment"] = 1
    valid_treated = valid_treated[outcome_features]

    valid_control = valid_df[propensity_features].copy()
    valid_control["treatment"] = 0
    valid_control = valid_control[outcome_features]

    mu1_hat[valid_idx] = outcome_model.predict_proba(valid_treated)[:, 1]
    mu0_hat[valid_idx] = outcome_model.predict_proba(valid_control)[:, 1]

    propensity_metrics.append(
        {
            "fold": fold,
            "roc_auc": roc_auc_score(valid_df["treatment"], e_valid),
            "average_precision": average_precision_score(valid_df["treatment"], e_valid),
            "brier_score": brier_score_loss(valid_df["treatment"], e_valid),
        }
    )
    outcome_metrics.append(
        {
            "fold": fold,
            "roc_auc": roc_auc_score(valid_df["outcome"], y_valid_hat),
            "average_precision": average_precision_score(valid_df["outcome"], y_valid_hat),
            "brier_score": brier_score_loss(valid_df["outcome"], y_valid_hat),
        }
    )

model_df["e_hat"] = e_hat
model_df["mu1_hat"] = mu1_hat
model_df["mu0_hat"] = mu0_hat
model_df["mu_diff_hat"] = model_df["mu1_hat"] - model_df["mu0_hat"]

pd.DataFrame(propensity_metrics)

	fold	roc_auc	average_precision	brier_score
0	1	0.7769	0.3407	0.0660
1	2	0.7622	0.3058	0.0668
2	3	0.7661	0.3205	0.0665

This output is part of the heterogeneous treatment effects across product segments workflow. Read it as a checkpoint: it either verifies an input, defines reusable analysis machinery, or produces a diagnostic that motivates the next step in the notebook.

Inspect Outcome Model Metrics

This cell shows the held-out predictive performance of the outcome model across folds. Clicks are sparse, so average precision is often more informative than accuracy. These metrics are diagnostics for nuisance models, not the final causal answer.

pd.DataFrame(outcome_metrics)

	fold	roc_auc	average_precision	brier_score
0	1	0.7119	0.1232	0.0363
1	2	0.6934	0.1111	0.0371
2	3	0.7098	0.1132	0.0368

The nuisance-model metrics show how well the supporting prediction models perform. They are not the causal answer, but weak nuisance models can make IPW, DR, and policy estimates less reliable.

Compute Per-Row AIPW Scores

This cell applies the AIPW formula to every row. The mean of aipw_score across all rows is the global doubly robust estimate. The mean within a segment is the segment-specific doubly robust estimate. Propensity scores are clipped away from 0 and 1 for numerical stability.

EPS = 0.01

e = model_df["e_hat"].clip(EPS, 1 - EPS).to_numpy()
mu1 = model_df["mu1_hat"].to_numpy()
mu0 = model_df["mu0_hat"].to_numpy()
t_np = model_df["treatment"].to_numpy()
y_np = model_df["outcome"].to_numpy()

model_df["aipw_score"] = (mu1 - mu0) + t_np * (y_np - mu1) / e - (1 - t_np) * (y_np - mu0) / (1 - e)

global_dr_lift = model_df["aipw_score"].mean()
global_dr_se = model_df["aipw_score"].std(ddof=1) / np.sqrt(len(model_df))

pd.Series(
    {
        "global_dr_lift": global_dr_lift,
        "standard_error": global_dr_se,
        "ci_95_lower": global_dr_lift - 1.96 * global_dr_se,
        "ci_95_upper": global_dr_lift + 1.96 * global_dr_se,
    }
)

global_dr_lift   0.0111
standard_error   0.0025
ci_95_lower      0.0062
ci_95_upper      0.0161
dtype: float64

The AIPW score combines outcome-model predictions with propensity-weighted residual corrections. This is the key doubly robust object: it can remain consistent if either the propensity model or the outcome model is correctly specified.

Check Nuisance Prediction Distributions

This cell summarizes propensity scores, outcome-model predictions, and AIPW scores. It helps catch instability before segmenting the estimate. Extreme AIPW score tails can make segment-level estimates noisy, especially for small segments.

model_df[["e_hat", "mu1_hat", "mu0_hat", "mu_diff_hat", "aipw_score"]].describe(
    percentiles=[0.01, 0.05, 0.10, 0.50, 0.90, 0.95, 0.99]
)

	e_hat	mu1_hat	mu0_hat	mu_diff_hat	aipw_score
count	150000.0000	150000.0000	150000.0000	150000.0000	150000.0000
mean	0.0801	0.0644	0.0360	0.0283	0.0111
std	0.0728	0.0442	0.0258	0.0186	0.9733
min	0.0001	0.0008	0.0004	0.0003	-8.2176
1%	0.0005	0.0044	0.0023	0.0020	-1.2288
5%	0.0033	0.0113	0.0061	0.0052	-0.8844
10%	0.0077	0.0173	0.0093	0.0080	-0.3209
50%	0.0575	0.0547	0.0300	0.0246	0.0473
90%	0.1892	0.1280	0.0727	0.0551	0.1285
95%	0.2280	0.1535	0.0884	0.0652	0.1622
99%	0.2900	0.1971	0.1157	0.0820	0.2637
max	0.3922	0.3898	0.2552	0.1408	99.2251

The nuisance prediction summaries check the range of predicted propensities and potential outcomes. Reasonable ranges are important because AIPW scores combine these predictions with observed outcomes.

Create Product-Relevant Segments

The segment definitions should reflect product questions. We create segments that map to common recommendation-system concerns:

Content category and subcategory.
User history depth.
Candidate set size.
Item exposure/popularity.
Time of day.

These are not the only possible segments, but they are interpretable and easy to explain in a portfolio writeup.

Build Segment Columns

This cell adds bucketed segment columns to model_df. Buckets make continuous variables easier to interpret. For item exposure, we use quartiles based on ranked exposure counts so each bucket has a similar number of rows even when exposure counts are skewed.

model_df["history_bucket"] = pd.cut(
    model_df["history_len"],
    bins=[-1, 0, 10, 30, 100, np.inf],
    labels=["0", "1-10", "11-30", "31-100", "101+"],
)

model_df["candidate_set_bucket"] = pd.cut(
    model_df["candidate_set_size"],
    bins=[0, 10, 25, 50, 100, np.inf],
    labels=["1-10", "11-25", "26-50", "51-100", "101+"],
    include_lowest=True,
)

model_df["item_exposure_quartile"] = pd.qcut(
    model_df["item_exposures"].rank(method="first"),
    q=4,
    labels=["Q1 lowest", "Q2", "Q3", "Q4 highest"],
)

model_df["time_of_day"] = pd.cut(
    model_df["hour"],
    bins=[-1, 5, 11, 16, 20, 23],
    labels=["overnight", "morning", "afternoon", "evening", "late_evening"],
)

model_df[["history_bucket", "candidate_set_bucket", "item_exposure_quartile", "time_of_day"]].head()

	history_bucket	candidate_set_bucket	item_exposure_quartile	time_of_day
0	31-100	26-50	Q2	afternoon
1	31-100	51-100	Q2	overnight
2	31-100	51-100	Q2	afternoon
3	31-100	51-100	Q3	morning
4	31-100	51-100	Q2	morning

The segment columns translate raw covariates into product-readable groups. This prepares the analysis for heterogeneity and policy simulation, where segment-level effects are easier to act on than row-level scores.

Segment Effect Estimator

For each segment, we compute:

Number of rows.
Number of treated and control rows.
Naive lift within the segment.
Doubly robust lift within the segment.
Standard error and 95% confidence interval for the segment DR lift.

Small segments can produce unstable estimates, so the helper function filters out segments with too few rows or too few treated/control examples.

Define Segment Summary Functions

This cell defines reusable functions for segment analysis. segment_effects computes a table of estimates for one segment column. plot_segment_effects creates a horizontal confidence-interval plot for a segment table.

def segment_effects(data, segment_col, min_rows=1_000, min_treated=50, min_control=500):
    rows = []
    for segment_value, group in data.groupby(segment_col, observed=True, dropna=False):
        n_rows = len(group)
        treated_rows = int(group["treatment"].sum())
        control_rows = n_rows - treated_rows

        if n_rows < min_rows or treated_rows < min_treated or control_rows < min_control:
            continue

        treated_ctr = group.loc[group["treatment"] == 1, "outcome"].mean()
        control_ctr = group.loc[group["treatment"] == 0, "outcome"].mean()
        naive_lift = treated_ctr - control_ctr

        scores = group["aipw_score"].to_numpy()
        dr_lift = scores.mean()
        standard_error = scores.std(ddof=1) / np.sqrt(n_rows)

        rows.append(
            {
                "segment_col": segment_col,
                "segment": str(segment_value),
                "rows": n_rows,
                "treated_rows": treated_rows,
                "control_rows": control_rows,
                "treated_ctr": treated_ctr,
                "control_ctr": control_ctr,
                "naive_lift": naive_lift,
                "dr_lift": dr_lift,
                "standard_error": standard_error,
                "ci_95_lower": dr_lift - 1.96 * standard_error,
                "ci_95_upper": dr_lift + 1.96 * standard_error,
            }
        )

    return pd.DataFrame(rows).sort_values("dr_lift", ascending=False).reset_index(drop=True)


def plot_segment_effects(effect_df, title, max_segments=15):
    plot_df = effect_df.head(max_segments).sort_values("dr_lift")
    y = np.arange(len(plot_df))
    lower_error = plot_df["dr_lift"] - plot_df["ci_95_lower"]
    upper_error = plot_df["ci_95_upper"] - plot_df["dr_lift"]

    plt.figure(figsize=(10, max(4, 0.38 * len(plot_df))))
    plt.errorbar(
        x=plot_df["dr_lift"],
        y=y,
        xerr=[lower_error, upper_error],
        fmt="o",
        capsize=3,
    )
    plt.axvline(0, color="black", linewidth=1)
    plt.yticks(y, plot_df["segment"])
    plt.title(title)
    plt.xlabel("Doubly robust lift in click probability")
    plt.ylabel("Segment")
    plt.tight_layout()

This helper defines how segment-level effects will be computed and filtered. Minimum row, treatment, and control counts keep the segment results from being driven by tiny groups.

Effects By Category

Category-level effects answer whether broad content types are more or less position-sensitive. This is a useful product lens because ranking policies often need to balance different content families.

Estimate Category-Level Effects

This cell estimates doubly robust lift separately for each broad category. The table is sorted by dr_lift, so the top rows are the categories with the largest estimated top-3 effect among segments that pass the minimum sample-size filters.

category_effects = segment_effects(model_df, "category", min_rows=1_000, min_treated=50, min_control=500)
category_effects

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	category	weather	2229	282	1947	0.1915	0.0370	0.1545	0.0501	0.0122	0.0261	0.0741
1	category	tv	6458	594	5864	0.1734	0.0425	0.1309	0.0341	0.0088	0.0168	0.0515
2	category	movies	3442	249	3193	0.0683	0.0263	0.0420	0.0309	0.0300	-0.0279	0.0897
3	category	health	7739	442	7297	0.0475	0.0306	0.0170	0.0303	0.0235	-0.0158	0.0764
4	category	music	6949	740	6209	0.1608	0.0480	0.1128	0.0232	0.0159	-0.0079	0.0542
5	category	sports	15075	1323	13752	0.1406	0.0391	0.1015	0.0192	0.0075	0.0046	0.0338
6	category	foodanddrink	9425	508	8917	0.0669	0.0267	0.0402	0.0180	0.0133	-0.0081	0.0441
7	category	news	40583	3685	36898	0.1085	0.0366	0.0719	0.0100	0.0033	0.0036	0.0165
8	category	video	2297	201	2096	0.0896	0.0406	0.0490	0.0070	0.0147	-0.0219	0.0358
9	category	entertainment	9061	674	8387	0.0579	0.0248	0.0331	0.0069	0.0108	-0.0143	0.0281
10	category	travel	8220	482	7738	0.0726	0.0231	0.0495	0.0046	0.0071	-0.0093	0.0185
11	category	finance	14499	1178	13321	0.0823	0.0324	0.0500	-0.0010	0.0050	-0.0108	0.0087
12	category	lifestyle	16973	1264	15709	0.0696	0.0368	0.0328	-0.0047	0.0064	-0.0174	0.0079
13	category	autos	7047	391	6656	0.0332	0.0263	0.0070	-0.0074	0.0095	-0.0261	0.0113

The category-level table estimates where top-rank exposure appears most or least valuable across content categories. These differences are useful for product strategy, but they should be read with uncertainty and sample-size filters in mind.

Plot Category-Level Effects

This cell visualizes the category-level doubly robust estimates with 95% confidence intervals. Segments whose intervals are entirely above zero have evidence of positive top-3 lift under the model assumptions. Wide intervals suggest the segment estimate is noisy.

plot_segment_effects(category_effects, "Doubly Robust Top-3 Lift By Category")

Effects By Subcategory

Subcategories are more granular than categories. They can reveal product patterns that broad categories hide. Because subcategories can be sparse, we use stricter filtering and focus on subcategories with enough treated and control rows.

Estimate Subcategory-Level Effects

This cell estimates doubly robust lift for subcategories. The result can be used to identify more specific content areas where rank position appears especially important. These estimates should be interpreted carefully because granular segments have less data.

subcategory_effects = segment_effects(model_df, "subcategory", min_rows=1_500, min_treated=75, min_control=750)
subcategory_effects.head(20)

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	subcategory	tvnews	2051	282	1769	0.2270	0.0543	0.1727	0.0655	0.0161	0.0339	0.0971
1	subcategory	weathertopstories	2229	282	1947	0.1915	0.0370	0.1545	0.0501	0.0122	0.0261	0.0741
2	subcategory	tv-celebrity	2665	214	2451	0.1776	0.0465	0.1311	0.0477	0.0168	0.0148	0.0805
3	subcategory	restaurantsandnews	1692	98	1594	0.0918	0.0270	0.0649	0.0460	0.0505	-0.0531	0.1450
4	subcategory	cma-awards	2034	126	1908	0.0317	0.0257	0.0061	0.0294	0.0494	-0.0674	0.1261
5	subcategory	music-celebrity	2460	342	2118	0.1959	0.0699	0.1260	0.0282	0.0151	-0.0015	0.0579
6	subcategory	traveltripideas	2225	125	2100	0.0800	0.0214	0.0586	0.0278	0.0203	-0.0120	0.0676
7	subcategory	elections-2020-us	2262	249	2013	0.1205	0.0373	0.0832	0.0271	0.0232	-0.0184	0.0726
8	subcategory	football_nfl	6235	674	5561	0.1884	0.0500	0.1384	0.0251	0.0077	0.0099	0.0402
9	subcategory	finance-companies	4255	476	3779	0.1408	0.0466	0.0942	0.0247	0.0134	-0.0014	0.0509
10	subcategory	newscrime	5118	611	4507	0.1489	0.0546	0.0944	0.0220	0.0091	0.0042	0.0397
11	subcategory	nutrition	2057	137	1920	0.0584	0.0234	0.0350	0.0215	0.0228	-0.0232	0.0661
12	subcategory	recipes	1640	92	1548	0.0761	0.0304	0.0457	0.0205	0.0231	-0.0247	0.0657
13	subcategory	newsus	12301	1194	11107	0.1449	0.0426	0.1023	0.0176	0.0056	0.0066	0.0287
14	subcategory	baseball_mlb	1537	107	1430	0.0654	0.0280	0.0374	0.0130	0.0206	-0.0273	0.0533
15	subcategory	shop-holidays	2293	270	2023	0.0852	0.0262	0.0590	0.0118	0.0091	-0.0061	0.0296
16	subcategory	newsgoodnews	2342	139	2203	0.0791	0.0209	0.0583	0.0105	0.0131	-0.0151	0.0361
17	subcategory	entertainment-celebrity	2615	236	2379	0.0678	0.0298	0.0380	0.0056	0.0107	-0.0153	0.0266
18	subcategory	newsworld	8377	719	7658	0.0682	0.0248	0.0433	0.0048	0.0057	-0.0064	0.0161
19	subcategory	voices	1823	97	1726	0.0619	0.0301	0.0317	0.0046	0.0180	-0.0307	0.0398

Plot Top Subcategory Effects

This plot shows the subcategories with the largest estimated doubly robust lift. This is useful for product storytelling, but it is also where multiple-comparisons risk appears. Treat these as hypotheses for deeper analysis rather than final policy recommendations.

plot_segment_effects(subcategory_effects, "Largest Doubly Robust Top-3 Lift By Subcategory", max_segments=15)

The subcategory results drill into more granular content groups. Because finer segments have smaller sample sizes, these estimates are best used to generate hypotheses rather than final policy rules on their own.

Effects By User History Depth

User history depth is a proxy for how much prior behavior the system has about the user. A rank-position effect may differ between cold-start or low-history users and users with long histories.

Estimate Effects By History Bucket

This cell estimates doubly robust top-3 lift for each user-history bucket. If low-history users show stronger effects, that could mean position matters more when personalization has less information. If high-history users show stronger effects, that could mean ranked personalization is especially effective for known users.

history_effects = segment_effects(model_df, "history_bucket", min_rows=1_000, min_treated=50, min_control=500)
history_effects

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	history_bucket	1-10	39649	3573	36076	0.1184	0.0311	0.0873	0.0168	0.0040	0.0088	0.0247
1	history_bucket	101+	10449	725	9724	0.0966	0.0467	0.0499	0.0103	0.0132	-0.0155	0.0361
2	history_bucket	11-30	49100	3896	45204	0.1001	0.0325	0.0676	0.0102	0.0039	0.0025	0.0179
3	history_bucket	31-100	47750	3606	44144	0.0885	0.0356	0.0528	0.0080	0.0051	-0.0020	0.0180
4	history_bucket	0	3052	213	2839	0.1033	0.0313	0.0719	0.0057	0.0089	-0.0117	0.0231

The history-bucket results show whether rank exposure matters differently for users with shallow versus deep prior histories. This is directly relevant to personalization because new and experienced users may respond differently.

Plot Effects By History Bucket

This cell plots the user-history bucket estimates. The confidence intervals help distinguish meaningful differences from noise. The chart is a product-facing way to discuss whether rank position is more important for users with sparse or rich histories.

plot_segment_effects(history_effects, "Doubly Robust Top-3 Lift By User History Depth")

Effects By Candidate Set Size

Candidate set size describes how many items appeared in the same impression. Longer lists may create more attention decay, so top-3 placement could be more valuable when users are shown many options.

Estimate Effects By Candidate Set Bucket

This cell estimates doubly robust lift for each candidate-set-size bucket. The result helps answer whether position bias is stronger in longer slates than shorter slates.

candidate_effects = segment_effects(model_df, "candidate_set_bucket", min_rows=1_000, min_treated=50, min_control=500)
candidate_effects

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	candidate_set_bucket	1-10	5865	3068	2797	0.2627	0.1376	0.1251	0.2275	0.0161	0.1959	0.2591
1	candidate_set_bucket	26-50	35116	3030	32086	0.0462	0.0430	0.0032	0.0142	0.0030	0.0083	0.0202
2	candidate_set_bucket	11-25	16951	2895	14056	0.0739	0.0706	0.0033	0.0067	0.0050	-0.0030	0.0164
3	candidate_set_bucket	51-100	48463	2052	46411	0.0249	0.0265	-0.0017	0.0022	0.0033	-0.0043	0.0086
4	candidate_set_bucket	101+	43605	968	42637	0.0134	0.0169	-0.0035	-0.0088	0.0068	-0.0222	0.0046

The candidate-set results check whether ranking lift changes when users face larger or smaller recommendation slates. That matters because position effects can depend on how much choice the user sees.

Plot Effects By Candidate Set Bucket

This cell visualizes the candidate-set bucket estimates. If larger slates have higher top-3 lift, that supports the idea that visibility matters more when there are many competing items.

plot_segment_effects(candidate_effects, "Doubly Robust Top-3 Lift By Candidate Set Size")

Effects By Item Exposure Level

Item exposure is a simple popularity/visibility proxy. The top-3 effect may differ between rarely exposed items and frequently exposed items. For example, highly exposed items may already have strong baseline demand, while lower-exposure items may depend more on prominent placement.

Estimate Effects By Exposure Quartile

This cell estimates doubly robust lift across item-exposure quartiles. Because exposure counts are skewed, quartiles are useful: each bucket has a roughly similar number of displayed rows, which keeps estimates more stable.

exposure_effects = segment_effects(model_df, "item_exposure_quartile", min_rows=1_000, min_treated=50, min_control=500)
exposure_effects

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	item_exposure_quartile	Q4 highest	37500	4525	32975	0.1421	0.0469	0.0952	0.0190	0.0032	0.0128	0.0253
1	item_exposure_quartile	Q3	37500	3248	34252	0.0927	0.0343	0.0583	0.0121	0.0052	0.0018	0.0223
2	item_exposure_quartile	Q2	37500	2327	35173	0.0688	0.0275	0.0412	0.0080	0.0050	-0.0019	0.0179
3	item_exposure_quartile	Q1 lowest	37500	1913	35587	0.0627	0.0287	0.0341	0.0055	0.0062	-0.0066	0.0176

The exposure-quartile results compare items with different baseline visibility. This helps separate rank effects from item popularity dynamics that may already favor highly exposed items.

Plot Effects By Exposure Quartile

This cell plots top-3 lift by exposure quartile. The result can support a product interpretation about whether prominent ranking is more useful for already-popular items or for items that otherwise receive less exposure.

plot_segment_effects(exposure_effects, "Doubly Robust Top-3 Lift By Item Exposure Quartile")

The exposure-quartile results compare items with different baseline visibility. This helps separate rank effects from item popularity dynamics that may already favor highly exposed items.

Effects By Time Of Day

User attention and browsing behavior may change throughout the day. Segmenting by time of day can reveal whether top placement matters more during certain browsing contexts.

Estimate Effects By Time Of Day

This cell estimates doubly robust top-3 lift for each time-of-day bucket. Time segments are broad enough to remain interpretable without being too sparse.

time_effects = segment_effects(model_df, "time_of_day", min_rows=1_000, min_treated=50, min_control=500)
time_effects

	segment_col	segment	rows	treated_rows	control_rows	treated_ctr	control_ctr	naive_lift	dr_lift	standard_error	ci_95_lower	ci_95_upper
0	time_of_day	afternoon	41771	3354	38417	0.0987	0.0327	0.0660	0.0145	0.0049	0.0049	0.0242
1	time_of_day	overnight	21507	1552	19955	0.1108	0.0318	0.0790	0.0130	0.0061	0.0010	0.0250
2	time_of_day	morning	65442	5396	60046	0.1042	0.0361	0.0680	0.0108	0.0039	0.0032	0.0184
3	time_of_day	late_evening	4993	452	4541	0.1173	0.0381	0.0792	0.0100	0.0116	-0.0128	0.0329
4	time_of_day	evening	16287	1259	15028	0.0842	0.0317	0.0525	0.0018	0.0075	-0.0129	0.0165

The time-of-day results check whether the ranking effect changes across usage contexts. Different browsing moments can have different intent, so this is a useful product diagnostic.

Plot Effects By Time Of Day

This cell visualizes time-of-day effect estimates. Differences here can become hypotheses about user attention patterns, though time-of-day effects should be interpreted cautiously because they can correlate with content mix and user mix.

plot_segment_effects(time_effects, "Doubly Robust Top-3 Lift By Time Of Day")

The time-of-day results check whether the ranking effect changes across usage contexts. Different browsing moments can have different intent, so this is a useful product diagnostic.

Product Summary Table

The previous sections looked at each segmentation dimension separately. This section combines all segment tables into one product summary so we can quickly see the strongest and weakest estimated effects across the notebook.

Combine Segment Results

This cell stacks all segment-effect tables and sorts them by estimated doubly robust lift. The top rows show segments where top-3 placement appears most valuable. The bottom rows show segments where estimated lift is smallest. This table is a starting point for product interpretation, not a final ranking policy.

all_effects = pd.concat(
    [
        category_effects,
        subcategory_effects,
        history_effects,
        candidate_effects,
        exposure_effects,
        time_effects,
    ],
    ignore_index=True,
)

summary_cols = [
    "segment_col",
    "segment",
    "rows",
    "treated_rows",
    "control_rows",
    "naive_lift",
    "dr_lift",
    "ci_95_lower",
    "ci_95_upper",
]

all_effects[summary_cols].sort_values("dr_lift", ascending=False).head(20)

	segment_col	segment	rows	treated_rows	control_rows	naive_lift	dr_lift	ci_95_lower	ci_95_upper
50	candidate_set_bucket	1-10	5865	3068	2797	0.1251	0.2275	0.1959	0.2591
14	subcategory	tvnews	2051	282	1769	0.1727	0.0655	0.0339	0.0971
15	subcategory	weathertopstories	2229	282	1947	0.1545	0.0501	0.0261	0.0741
0	category	weather	2229	282	1947	0.1545	0.0501	0.0261	0.0741
16	subcategory	tv-celebrity	2665	214	2451	0.1311	0.0477	0.0148	0.0805
17	subcategory	restaurantsandnews	1692	98	1594	0.0649	0.0460	-0.0531	0.1450
1	category	tv	6458	594	5864	0.1309	0.0341	0.0168	0.0515
2	category	movies	3442	249	3193	0.0420	0.0309	-0.0279	0.0897
3	category	health	7739	442	7297	0.0170	0.0303	-0.0158	0.0764
18	subcategory	cma-awards	2034	126	1908	0.0061	0.0294	-0.0674	0.1261
19	subcategory	music-celebrity	2460	342	2118	0.1260	0.0282	-0.0015	0.0579
20	subcategory	traveltripideas	2225	125	2100	0.0586	0.0278	-0.0120	0.0676
21	subcategory	elections-2020-us	2262	249	2013	0.0832	0.0271	-0.0184	0.0726
22	subcategory	football_nfl	6235	674	5561	0.1384	0.0251	0.0099	0.0402
23	subcategory	finance-companies	4255	476	3779	0.0942	0.0247	-0.0014	0.0509
4	category	music	6949	740	6209	0.1128	0.0232	-0.0079	0.0542
24	subcategory	newscrime	5118	611	4507	0.0944	0.0220	0.0042	0.0397
25	subcategory	nutrition	2057	137	1920	0.0350	0.0215	-0.0232	0.0661
26	subcategory	recipes	1640	92	1548	0.0457	0.0205	-0.0247	0.0657
5	category	sports	15075	1323	13752	0.1015	0.0192	0.0046	0.0338

The combined segment table creates one place to compare heterogeneous effects across dimensions. This makes it easier to identify robust high-lift segments rather than cherry-picking from separate outputs.

Inspect Segments With The Smallest Estimated Lift

This cell shows the lowest estimated segment effects. These segments may be less position-sensitive, too noisy, or poorly supported by the data. They are useful because product decisions often need to know where top placement is less incremental, not only where it helps most.

all_effects[summary_cols].sort_values("dr_lift", ascending=True).head(20)

	segment_col	segment	rows	treated_rows	control_rows	naive_lift	dr_lift	ci_95_lower	ci_95_upper
44	subcategory	autosenthusiasts	1693	84	1609	-0.0192	-0.0297	-0.0468	-0.0125
43	subcategory	markets	3302	213	3089	0.0104	-0.0148	-0.0287	-0.0010
42	subcategory	lifestyleroyals	3718	272	3446	0.0664	-0.0092	-0.0277	0.0093
54	candidate_set_bucket	101+	43605	968	42637	-0.0035	-0.0088	-0.0222	0.0046
41	subcategory	celebrity	4296	310	3986	0.0216	-0.0085	-0.0209	0.0039
13	category	autos	7047	391	6656	0.0070	-0.0074	-0.0261	0.0113
40	subcategory	football_ncaa	1884	126	1758	0.0424	-0.0064	-0.0373	0.0245
39	subcategory	foodnews	3205	163	3042	0.0286	-0.0055	-0.0254	0.0145
12	category	lifestyle	16973	1264	15709	0.0328	-0.0047	-0.0174	0.0079
38	subcategory	finance-real-estate	2408	138	2270	0.0208	-0.0044	-0.0264	0.0176
37	subcategory	autosnews	1963	127	1836	0.0216	-0.0038	-0.0262	0.0186
36	subcategory	lifestylebuzz	3636	276	3360	0.0020	-0.0032	-0.0469	0.0405
11	category	finance	14499	1178	13321	0.0500	-0.0010	-0.0108	0.0087
63	time_of_day	evening	16287	1259	15028	0.0525	0.0018	-0.0129	0.0165
53	candidate_set_bucket	51-100	48463	2052	46411	-0.0017	0.0022	-0.0043	0.0086
35	subcategory	travelnews	3079	204	2875	0.0744	0.0024	-0.0165	0.0214
34	subcategory	newspolitics	6691	474	6217	0.0366	0.0043	-0.0144	0.0229
33	subcategory	voices	1823	97	1726	0.0317	0.0046	-0.0307	0.0398
10	category	travel	8220	482	7738	0.0495	0.0046	-0.0093	0.0185
32	subcategory	newsworld	8377	719	7658	0.0433	0.0048	-0.0064	0.0161

The low-lift segments are as important as the high-lift segments because they warn where aggressive promotion may not help. These segments are natural candidates for caution in policy simulation.

Compare Naive Lift And Doubly Robust Lift By Segment

This cell plots naive lift against doubly robust lift for all segments. Points far from the diagonal are segments where causal adjustment changed the story substantially. This is a useful diagnostic for explaining why causal adjustment matters.

plt.figure(figsize=(7, 6))
sns.scatterplot(data=all_effects, x="naive_lift", y="dr_lift", hue="segment_col", s=70)
lims = [min(all_effects["naive_lift"].min(), all_effects["dr_lift"].min()), max(all_effects["naive_lift"].max(), all_effects["dr_lift"].max())]
plt.plot(lims, lims, color="black", linestyle="--", linewidth=1)
plt.axhline(0, color="gray", linewidth=1)
plt.axvline(0, color="gray", linewidth=1)
plt.title("Naive Versus Doubly Robust Segment Lift")
plt.xlabel("Naive lift")
plt.ylabel("Doubly robust lift")
plt.tight_layout()

The naive lift quantifies the raw click-rate gap between treatment and control groups. It is the starting point for the project: later notebooks ask how much of this apparent lift remains after adjustment.

Interpretation Checklist

Use these questions to turn the notebook into a product narrative:

Which broad categories have the largest adjusted top-3 lift?
Do granular subcategories reveal stronger patterns than broad categories?
Does rank position matter more for low-history users or high-history users?
Does top-3 lift grow when candidate sets are larger?
Do low-exposure or high-exposure items benefit more from top placement?
Are any segment estimates too noisy to trust because confidence intervals are wide?
Where does doubly robust lift differ strongly from naive lift?

A strong portfolio conclusion should sound like this:

The global top-3 effect is positive, but the estimated lift is concentrated in specific segments. This suggests that ranking-position interventions should be evaluated not only on average CTR lift, but also on where visibility creates the most incremental engagement.

Avoid overclaiming. These are observational estimates from logged recommendation data, so the final writeup should mention remaining risks: unobserved confounding, limited feature set, overlap, and the need for online experimentation to validate policy changes.