09. Dataset Profiling with AI

Dataset profiling is the bridge between a business dataset and a credible causal design. In ordinary analytics, profiling often means checking column types, missing values, outliers, and duplicates. In causal inference, those checks are only the surface. We also need to ask:

What is the unit of analysis?
When is treatment assigned?
Which variables were measured before treatment?
Which variables might be outcomes, mediators, descendants of treatment, or leakage?
Does missingness depend on treatment, outcome, or operational workflow?
What domain questions must be answered before any estimator is trusted?

This notebook teaches a workflow where deterministic Python profiling does the reliable measurement work, and local LLMs help generate structured review notes, domain questions, and potential failure modes. The LLM is useful, but it is never treated as the source of truth.

Learning Goals

By the end of this notebook you should be able to:

Profile a dataset in a way that is specific to causal inference, not just general data quality.
Separate variable availability from variable eligibility for adjustment.
Detect common dataset risks: duplicate units, timing ambiguity, post-treatment controls, missingness patterns, and leakage.
Build a compact evidence bundle that an LLM can review without seeing the whole dataset.
Parse, audit, and score AI-generated dataset-profile notes.
Compare multiple local models on the same profiling task and treat instability as an empirical result.

Live Model Note

This course treats LLM behavior as an empirical object. These notebooks may include live local-model calls, so outputs can vary across model versions, hardware, decoding settings, prompt wording, package versions, and reruns. That instability is part of the lesson: AI-assisted causal work needs deterministic checks, structured outputs, model comparison, repair logic, and human review.

For this notebook in particular, dataset profiling can be brittle because the model has to infer roles from incomplete metadata. A model may correctly identify missingness but miss timing leakage, or it may notice post-treatment variables but overstate whether a causal effect is identifiable. We will let that brittleness remain visible and then audit it.

1. Setup

The deterministic parts of the notebook use pandas, numpy, scikit-learn, statsmodels, and visualization packages. The optional AI sections use the shared local Hugging Face utilities used throughout Course 05.

The notebook is safe to render as HTML because the live model calls are controlled by flags. If you render without executing live model cells, the notebook still documents the workflow. If you run the notebook locally with the downloaded models, it becomes a model-comparison lab.

import importlib.util
import json
import sys
import textwrap
import warnings
from pathlib import Path
from typing import Literal

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import torch
from IPython.display import Markdown, display
from pydantic import BaseModel, Field

warnings.filterwarnings('ignore', category=FutureWarning)

RUN_LIVE_LOCAL_LLM = True
RUN_FULL_MODEL_COMPARISON = True
RUN_SCHEMA_REPAIR_RETRY = True

LOCAL_SMOKE_TEST_MODEL = 'Qwen/Qwen2.5-0.5B-Instruct'
LOCAL_FAST_MODEL = 'Qwen/Qwen2.5-7B-Instruct'
LOCAL_STRONG_MODEL = 'Qwen/Qwen2.5-14B-Instruct'
LOCAL_SCALE_MODEL = 'Qwen/Qwen2.5-32B-Instruct'

LOCAL_ALT_REASONING_MODEL = 'microsoft/Phi-3.5-mini-instruct'
LOCAL_ALT_OPEN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.3'
LOCAL_MISTRAL_SMALL_MODEL = 'mistralai/Mistral-Small-3.1-24B-Instruct-2503'
LOCAL_GEMMA_MODEL = 'google/gemma-3-27b-it'
LOCAL_LLAMA_MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

MODEL_ID = LOCAL_FAST_MODEL
MAX_NEW_TOKENS = 1800
TEMPERATURE = 0.0
SEED = 209
MODEL_COMPARISON_CASE_LIMIT = 1

MODELS_TO_COMPARE = [
    ('Qwen 0.5B', LOCAL_SMOKE_TEST_MODEL, 'pipeline smoke test'),
    ('Qwen 7B', LOCAL_FAST_MODEL, 'fast default'),
    ('Qwen 14B', LOCAL_STRONG_MODEL, 'strong local analysis'),
    ('Qwen 32B', LOCAL_SCALE_MODEL, 'scale comparison'),
    ('Phi mini', LOCAL_ALT_REASONING_MODEL, 'compact non-Qwen comparison'),
    ('Mistral 7B', LOCAL_ALT_OPEN_MODEL, '7B model-family comparison'),
    ('Mistral Small 24B', LOCAL_MISTRAL_SMALL_MODEL, 'strong non-Qwen comparison'),
    ('Gemma 3 27B', LOCAL_GEMMA_MODEL, 'large non-Qwen comparison'),
    ('Llama 3.1 8B', LOCAL_LLAMA_MODEL, 'industry-standard instruct baseline'),
]

np.random.seed(SEED)
sns.set_theme(style='whitegrid', context='notebook')
pd.set_option('display.max_colwidth', 140)

def has_package(module_name):
    return importlib.util.find_spec(module_name) is not None

package_status = pd.DataFrame(
    [
        {'package': 'pandas', 'available': has_package('pandas'), 'used_for': 'deterministic profiling tables'},
        {'package': 'seaborn', 'available': has_package('seaborn'), 'used_for': 'profiling plots'},
        {'package': 'statsmodels', 'available': has_package('statsmodels'), 'used_for': 'simple regression smoke tests'},
        {'package': 'pydantic', 'available': has_package('pydantic'), 'used_for': 'structured AI profile schema'},
        {'package': 'transformers', 'available': has_package('transformers'), 'used_for': 'optional local LLM profiling'},
        {'package': 'torch', 'available': has_package('torch'), 'used_for': 'GPU inference if live LLMs are enabled'},
    ]
)

print(f'CUDA available to this kernel: {torch.cuda.is_available()}')
package_status

CUDA available to this kernel: True

	package	available	used_for
0	pandas	True	deterministic profiling tables
1	seaborn	True	profiling plots
2	statsmodels	True	simple regression smoke tests
3	pydantic	True	structured AI profile schema
4	transformers	True	optional local LLM profiling
5	torch	True	GPU inference if live LLMs are enabled

2. Why Dataset Profiling Is Causal Work

A causal analysis can fail before the estimator is chosen. Many failures come from the dataset itself:

The row is not the unit at which treatment is assigned.
The outcome window overlaps with the treatment window.
A variable that looks like a covariate is actually measured after treatment.
Duplicates create implicit weights for some units.
Missingness reveals operational workflow, not random noise.
A feature contains future information and quietly leaks the outcome.

A useful AI assistant can help produce a sharper checklist, but it must be grounded in deterministic evidence: schema summaries, timing metadata, missingness tables, duplicate checks, and simple balance diagnostics.

profiling_layers = pd.DataFrame(
    [
        {
            'layer': 'Basic data quality',
            'question': 'Do rows, columns, types, missingness, and duplicates look coherent?',
            'causal consequence': 'Prevents accidental weighting, impossible joins, and broken estimands.',
        },
        {
            'layer': 'Unit and time',
            'question': 'What is one observational unit, and when is treatment assigned?',
            'causal consequence': 'Defines the estimand, exposure window, outcome window, and clustering level.',
        },
        {
            'layer': 'Variable role',
            'question': 'Was each variable measured before treatment, during treatment, or after treatment?',
            'causal consequence': 'Separates eligible confounders from mediators, outcomes, and leakage.',
        },
        {
            'layer': 'Assignment audit',
            'question': 'Which pre-treatment variables predict treatment?',
            'causal consequence': 'Reveals confounding, positivity risks, and targeting rules.',
        },
        {
            'layer': 'Outcome and guardrails',
            'question': 'Are outcomes defined after treatment and consistently measured?',
            'causal consequence': 'Protects against metric drift and partial outcome capture.',
        },
        {
            'layer': 'AI-assisted review',
            'question': 'What risks and domain questions does a model infer from the evidence bundle?',
            'causal consequence': 'Expands the review, but only after deterministic profiling has created evidence.',
        },
    ]
)
profiling_layers

	layer	question	causal consequence
0	Basic data quality	Do rows, columns, types, missingness, and duplicates look coherent?	Prevents accidental weighting, impossible joins, and broken estimands.
1	Unit and time	What is one observational unit, and when is treatment assigned?	Defines the estimand, exposure window, outcome window, and clustering level.
2	Variable role	Was each variable measured before treatment, during treatment, or after treatment?	Separates eligible confounders from mediators, outcomes, and leakage.
3	Assignment audit	Which pre-treatment variables predict treatment?	Reveals confounding, positivity risks, and targeting rules.
4	Outcome and guardrails	Are outcomes defined after treatment and consistently measured?	Protects against metric drift and partial outcome capture.
5	AI-assisted review	What risks and domain questions does a model infer from the evidence bundle?	Expands the review, but only after deterministic profiling has created evidence.

Discussion

The important distinction is availability versus admissibility. A column may be available in the table but inadmissible in an adjustment set. For example, agent_uses_ai may be highly predictive of workload after an assistant is enabled, but it is downstream of enablement. Adjusting for it would change the estimand from the total effect of enablement to something closer to a controlled direct effect.

This is why a causal profile needs timing metadata. Without timing, an AI assistant may treat every predictive column as a useful control. That is exactly the behavior we want to catch.

3. Running Example: AI Support Assistant Rollout

We will use a synthetic dataset for a company that rolled out an AI assistant to support teams. The business wants to know whether enablement reduced human-handled workload without harming customer experience.

The row grain is intended to be team_id by week_start. The treatment is assistant_enabled. The primary outcome is human_handled_hours. The guardrail outcome is customer_satisfaction_score.

The dataset intentionally contains realistic profiling issues:

Some duplicated team-weeks.
Missing baseline volume that depends on region and readiness.
Missing satisfaction scores that are more common in high-volume queues.
Post-treatment variables that are tempting but should not be adjusted for when estimating a total effect.
A future variable that leaks information from the next week.

def simulate_support_assistant_data(n_teams=90, n_weeks=14, seed=SEED):
    rng = np.random.default_rng(seed)
    teams = pd.DataFrame(
        {
            'team_id': [f'T{idx:03d}' for idx in range(n_teams)],
            'region': rng.choice(['Americas', 'EMEA', 'APAC'], size=n_teams, p=[0.45, 0.35, 0.20]),
            'queue_type': rng.choice(['billing', 'technical', 'retention'], size=n_teams, p=[0.35, 0.45, 0.20]),
            'team_readiness': rng.normal(0, 1, size=n_teams),
            'manager_priority': rng.binomial(1, 0.35, size=n_teams),
        }
    )
    teams['baseline_ticket_volume'] = (
        240
        + 45 * (teams['queue_type'] == 'technical').astype(int)
        + 35 * (teams['queue_type'] == 'retention').astype(int)
        + 18 * teams['manager_priority']
        + 22 * teams['team_readiness']
        + rng.normal(0, 30, size=n_teams)
    ).clip(80, None)
    teams['baseline_satisfaction'] = (
        82
        - 0.018 * teams['baseline_ticket_volume']
        + 2.2 * teams['team_readiness']
        + rng.normal(0, 2.5, size=n_teams)
    ).clip(55, 95)

    weeks = pd.date_range('2025-01-06', periods=n_weeks, freq='W-MON')
    rows = []
    for _, team in teams.iterrows():
        rollout_score = (
            -0.7
            + 0.95 * team['manager_priority']
            + 0.65 * team['team_readiness']
            + 0.0035 * team['baseline_ticket_volume']
            + (0.25 if team['region'] == 'Americas' else 0.0)
        )
        rollout_probability = 1 / (1 + np.exp(-rollout_score))
        eligible_week = rng.integers(4, 10) if rng.uniform() < rollout_probability else 99
        for week_idx, week_start in enumerate(weeks):
            enabled = int(week_idx >= eligible_week)
            seasonal_pressure = 18 * np.sin((week_idx + 1) / n_weeks * 2 * np.pi)
            backlog_start = (
                55
                + 0.18 * team['baseline_ticket_volume']
                - 7.5 * team['team_readiness']
                + 9 * (team['queue_type'] == 'retention')
                + seasonal_pressure
                + rng.normal(0, 12)
            )
            staffing_hours = (
                420
                + 0.32 * team['baseline_ticket_volume']
                - 0.45 * backlog_start
                + 18 * team['manager_priority']
                + rng.normal(0, 18)
            )
            agent_uses_ai = np.nan
            deflection_rate = np.nan
            if enabled:
                agent_uses_ai = float(np.clip(
                    0.35
                    + 0.16 * team['team_readiness']
                    + 0.06 * team['manager_priority']
                    + rng.normal(0, 0.07),
                    0.05,
                    0.92,
                ))
                deflection_rate = float(np.clip(
                    0.12
                    + 0.38 * agent_uses_ai
                    + rng.normal(0, 0.04),
                    0.02,
                    0.65,
                ))

            true_total_effect = -22 - 12 * max(team['team_readiness'], 0)
            human_handled_hours = (
                120
                + 0.42 * team['baseline_ticket_volume']
                + 0.65 * backlog_start
                - 0.18 * staffing_hours
                - 7 * team['team_readiness']
                + true_total_effect * enabled
                + rng.normal(0, 18)
            )
            customer_satisfaction_score = float(np.clip(
                team['baseline_satisfaction']
                - 0.018 * backlog_start
                + 1.4 * enabled
                - 2.8 * max(enabled * (deflection_rate if not np.isnan(deflection_rate) else 0) - 0.35, 0)
                + rng.normal(0, 2.4),
                35,
                98,
            ))
            next_week_backlog = backlog_start + 0.09 * human_handled_hours - 7 * enabled + rng.normal(0, 10)

            rows.append(
                {
                    'team_id': team['team_id'],
                    'week_start': week_start,
                    'region': team['region'],
                    'queue_type': team['queue_type'],
                    'assistant_enabled': enabled,
                    'baseline_ticket_volume': team['baseline_ticket_volume'],
                    'baseline_satisfaction': team['baseline_satisfaction'],
                    'team_readiness': team['team_readiness'],
                    'manager_priority': team['manager_priority'],
                    'backlog_start': backlog_start,
                    'staffing_hours': staffing_hours,
                    'agent_uses_ai': agent_uses_ai,
                    'deflection_rate': deflection_rate,
                    'human_handled_hours': human_handled_hours,
                    'customer_satisfaction_score': customer_satisfaction_score,
                    'next_week_backlog': next_week_backlog,
                }
            )
    df = pd.DataFrame(rows)

    readiness_missing = (df['region'].eq('APAC') & (df['team_readiness'] < -0.25) & (rng.uniform(size=len(df)) < 0.28))
    df.loc[readiness_missing, 'baseline_ticket_volume'] = np.nan

    csat_missing = ((df['queue_type'].eq('technical')) & (df['backlog_start'] > df['backlog_start'].quantile(0.70)) & (rng.uniform(size=len(df)) < 0.32))
    df.loc[csat_missing, 'customer_satisfaction_score'] = np.nan

    duplicate_indices = rng.choice(df.index, size=8, replace=False)
    duplicates = df.loc[duplicate_indices].copy()
    duplicates['staffing_hours'] = duplicates['staffing_hours'] + rng.normal(0, 5, size=len(duplicates))
    df = pd.concat([df, duplicates], ignore_index=True).sample(frac=1, random_state=seed).reset_index(drop=True)
    return df

raw = simulate_support_assistant_data()
raw.head()

	team_id	week_start	region	queue_type	assistant_enabled	baseline_ticket_volume	baseline_satisfaction	team_readiness	manager_priority	backlog_start	staffing_hours	agent_uses_ai	deflection_rate	human_handled_hours	customer_satisfaction_score	next_week_backlog
0	T049	2025-03-03	Americas	technical	1	318.680028	80.363466	1.144821	1	84.914952	499.181159	0.570938	0.400712	161.649237	79.808041	79.694226
1	T069	2025-02-03	Americas	retention	0	330.834274	76.563602	1.978773	0	142.913594	481.671278	NaN	NaN	237.072721	75.507517	175.875164
2	T052	2025-03-31	Americas	technical	1	255.578769	69.362715	-0.029563	1	123.375787	477.236520	0.462384	0.234673	185.545918	66.100149	123.649601
3	T048	2025-01-06	Americas	billing	0	304.385110	72.127583	-0.500037	0	142.275779	426.880542	NaN	NaN	272.949791	68.228324	165.666386
4	T022	2025-02-24	Americas	technical	1	317.634111	76.879693	0.494718	0	102.309229	438.123429	0.367080	0.180488	233.986433	79.314878	108.473944

data_dictionary = pd.DataFrame(
    [
        {'variable': 'team_id', 'description': 'Support team identifier.', 'timing': 'unit id', 'initial_role': 'unit identifier'},
        {'variable': 'week_start', 'description': 'Start date for the reporting week.', 'timing': 'time id', 'initial_role': 'time identifier'},
        {'variable': 'region', 'description': 'Operating region for the team.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'queue_type', 'description': 'Primary queue served by the team.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'assistant_enabled', 'description': 'Whether the AI assistant was enabled for the team during the week.', 'timing': 'treatment week', 'initial_role': 'treatment'},
        {'variable': 'baseline_ticket_volume', 'description': 'Average weekly ticket volume before rollout planning.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'baseline_satisfaction', 'description': 'Pre-rollout customer satisfaction score.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'team_readiness', 'description': 'Internal readiness score used by enablement managers.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'manager_priority', 'description': 'Indicator that leadership prioritized the team for early enablement.', 'timing': 'pre-treatment', 'initial_role': 'candidate confounder'},
        {'variable': 'backlog_start', 'description': 'Open ticket backlog at the beginning of the week.', 'timing': 'pre-treatment within week', 'initial_role': 'candidate confounder'},
        {'variable': 'staffing_hours', 'description': 'Planned human staffing capacity for the week.', 'timing': 'pre-treatment within week', 'initial_role': 'candidate confounder'},
        {'variable': 'agent_uses_ai', 'description': 'Share of agents actively using the assistant after enablement.', 'timing': 'post-treatment', 'initial_role': 'mediator or compliance measure'},
        {'variable': 'deflection_rate', 'description': 'Share of contacts deflected after assistant interactions.', 'timing': 'post-treatment', 'initial_role': 'mediator'},
        {'variable': 'human_handled_hours', 'description': 'Human workload hours during the week.', 'timing': 'outcome window', 'initial_role': 'primary outcome'},
        {'variable': 'customer_satisfaction_score', 'description': 'Customer satisfaction score for tickets resolved during the week.', 'timing': 'outcome window', 'initial_role': 'guardrail outcome'},
        {'variable': 'next_week_backlog', 'description': 'Open backlog measured at the beginning of the following week.', 'timing': 'future', 'initial_role': 'future leakage risk'},
    ]
)

data_dictionary

	variable	description	timing	initial_role
0	team_id	Support team identifier.	unit id	unit identifier
1	week_start	Start date for the reporting week.	time id	time identifier
2	region	Operating region for the team.	pre-treatment	candidate confounder
3	queue_type	Primary queue served by the team.	pre-treatment	candidate confounder
4	assistant_enabled	Whether the AI assistant was enabled for the team during the week.	treatment week	treatment
5	baseline_ticket_volume	Average weekly ticket volume before rollout planning.	pre-treatment	candidate confounder
6	baseline_satisfaction	Pre-rollout customer satisfaction score.	pre-treatment	candidate confounder
7	team_readiness	Internal readiness score used by enablement managers.	pre-treatment	candidate confounder
8	manager_priority	Indicator that leadership prioritized the team for early enablement.	pre-treatment	candidate confounder
9	backlog_start	Open ticket backlog at the beginning of the week.	pre-treatment within week	candidate confounder
10	staffing_hours	Planned human staffing capacity for the week.	pre-treatment within week	candidate confounder
11	agent_uses_ai	Share of agents actively using the assistant after enablement.	post-treatment	mediator or compliance measure
12	deflection_rate	Share of contacts deflected after assistant interactions.	post-treatment	mediator
13	human_handled_hours	Human workload hours during the week.	outcome window	primary outcome
14	customer_satisfaction_score	Customer satisfaction score for tickets resolved during the week.	outcome window	guardrail outcome
15	next_week_backlog	Open backlog measured at the beginning of the following week.	future	future leakage risk

Discussion

The data dictionary is doing causal work. The column name deflection_rate sounds like a useful operational control. The timing metadata says it is post-treatment. That changes how we treat it.

For the total effect of enablement, post-treatment usage and deflection are part of the pathway by which enablement may affect workload. Adjusting for them would remove part of the effect we are trying to estimate. They may be useful in mechanism analysis, but not in the main total-effect adjustment set.

4. Basic Grain and Schema Profile

The first deterministic task is to verify the intended row grain. Here the intended unit is one row per team_id and week_start. If that grain is violated, every downstream estimate silently changes because duplicated rows give extra weight to some team-weeks.

def basic_schema_profile(df, unit_cols):
    duplicate_count = df.duplicated(unit_cols, keep=False).sum()
    expected_unique_units = df[unit_cols].drop_duplicates().shape[0]
    return pd.DataFrame(
        [
            {'metric': 'rows', 'value': len(df)},
            {'metric': 'columns', 'value': df.shape[1]},
            {'metric': 'unique team-weeks', 'value': expected_unique_units},
            {'metric': 'duplicated rows by intended grain', 'value': int(duplicate_count)},
            {'metric': 'first week', 'value': df['week_start'].min()},
            {'metric': 'last week', 'value': df['week_start'].max()},
            {'metric': 'treated rows', 'value': int(df['assistant_enabled'].sum())},
            {'metric': 'untreated rows', 'value': int((1 - df['assistant_enabled']).sum())},
        ]
    )

schema_profile = basic_schema_profile(raw, ['team_id', 'week_start'])
schema_profile

	metric	value
0	rows	1268
1	columns	16
2	unique team-weeks	1260
3	duplicated rows by intended grain	16
4	first week	2025-01-06 00:00:00
5	last week	2025-04-07 00:00:00
6	treated rows	463
7	untreated rows	805

type_profile = pd.DataFrame(
    {
        'variable': raw.columns,
        'dtype': [str(raw[col].dtype) for col in raw.columns],
        'non_null': [raw[col].notna().sum() for col in raw.columns],
        'missing_share': [raw[col].isna().mean() for col in raw.columns],
        'unique_values': [raw[col].nunique(dropna=True) for col in raw.columns],
    }
).merge(data_dictionary[['variable', 'timing', 'initial_role']], on='variable', how='left')

type_profile.sort_values(['missing_share', 'unique_values'], ascending=[False, True])

	variable	dtype	non_null	missing_share	unique_values	timing	initial_role
11	agent_uses_ai	float64	463	0.634858	451	post-treatment	mediator or compliance measure
12	deflection_rate	float64	463	0.634858	458	post-treatment	mediator
14	customer_satisfaction_score	float64	1216	0.041009	1208	outcome window	guardrail outcome
5	baseline_ticket_volume	float64	1244	0.018927	90	pre-treatment	candidate confounder
4	assistant_enabled	int64	1268	0.000000	2	treatment week	treatment
8	manager_priority	int64	1268	0.000000	2	pre-treatment	candidate confounder
2	region	str	1268	0.000000	3	pre-treatment	candidate confounder
3	queue_type	str	1268	0.000000	3	pre-treatment	candidate confounder
1	week_start	datetime64[us]	1268	0.000000	14	time id	time identifier
0	team_id	str	1268	0.000000	90	unit id	unit identifier
6	baseline_satisfaction	float64	1268	0.000000	90	pre-treatment	candidate confounder
7	team_readiness	float64	1268	0.000000	90	pre-treatment	candidate confounder
9	backlog_start	float64	1268	0.000000	1260	pre-treatment within week	candidate confounder
13	human_handled_hours	float64	1268	0.000000	1260	outcome window	primary outcome
15	next_week_backlog	float64	1268	0.000000	1260	future	future leakage risk
10	staffing_hours	float64	1268	0.000000	1268	pre-treatment within week	candidate confounder

Discussion

General profiling would stop at missingness and types. Causal profiling attaches those symptoms to timing and role.

For example, missing values in agent_uses_ai and deflection_rate are expected for untreated rows because those variables are only meaningful after enablement. That is different from missing pre-treatment ticket volume or missing guardrail outcomes. The former is structural; the latter can threaten identification or outcome validity.

5. Missingness as a Causal Signal

Missingness is not just a data-cleaning nuisance. It can reveal operational workflow. If satisfaction is missing more often in high-backlog treated weeks, complete-case analysis may select a non-comparable subset. If a pre-treatment covariate is missing more often in one rollout region, adjustment may rely on a biased analytic sample.

missing_table = (
    raw.isna()
    .mean()
    .rename('missing_share')
    .reset_index()
    .rename(columns={'index': 'variable'})
    .merge(data_dictionary[['variable', 'timing', 'initial_role']], on='variable', how='left')
    .sort_values('missing_share', ascending=False)
)

missing_by_treatment = (
    raw.assign(treatment_group=np.where(raw['assistant_enabled'].eq(1), 'enabled', 'not enabled'))
    .groupby('treatment_group')
    .apply(lambda frame: frame.isna().mean(numeric_only=False), include_groups=False)
    .T.reset_index()
    .rename(columns={'index': 'variable'})
)

missing_table.head(10)

	variable	missing_share	timing	initial_role
12	deflection_rate	0.634858	post-treatment	mediator
11	agent_uses_ai	0.634858	post-treatment	mediator or compliance measure
14	customer_satisfaction_score	0.041009	outcome window	guardrail outcome
5	baseline_ticket_volume	0.018927	pre-treatment	candidate confounder
3	queue_type	0.000000	pre-treatment	candidate confounder
0	team_id	0.000000	unit id	unit identifier
2	region	0.000000	pre-treatment	candidate confounder
1	week_start	0.000000	time id	time identifier
7	team_readiness	0.000000	pre-treatment	candidate confounder
6	baseline_satisfaction	0.000000	pre-treatment	candidate confounder

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
plot_missing = missing_table.query('missing_share > 0').copy()
sns.barplot(data=plot_missing, y='variable', x='missing_share', hue='timing', dodge=False, ax=axes[0])
axes[0].set_title('Missingness by variable')
axes[0].set_xlabel('Missing share')
axes[0].set_ylabel('')
axes[0].legend(loc='lower right')

missing_heatmap_data = raw.sort_values(['team_id', 'week_start']).isna().astype(int)
sns.heatmap(missing_heatmap_data.T, cbar=False, ax=axes[1])
axes[1].set_title('Missingness pattern across rows')
axes[1].set_xlabel('Rows sorted by team-week')
axes[1].set_ylabel('Variables')
plt.tight_layout()
plt.show()

missing_contrast = missing_by_treatment.merge(data_dictionary[['variable', 'timing', 'initial_role']], on='variable', how='left')
if {'enabled', 'not enabled'}.issubset(missing_contrast.columns):
    missing_contrast['enabled_minus_not_enabled'] = missing_contrast['enabled'] - missing_contrast['not enabled']
else:
    missing_contrast['enabled_minus_not_enabled'] = np.nan
missing_contrast.sort_values('enabled_minus_not_enabled', key=lambda s: s.abs(), ascending=False).head(10)

	variable	enabled	not enabled	timing	initial_role	enabled_minus_not_enabled
12	deflection_rate	0.000000	1.000000	post-treatment	mediator	-1.000000
11	agent_uses_ai	0.000000	1.000000	post-treatment	mediator or compliance measure	-1.000000
14	customer_satisfaction_score	0.008639	0.059627	outcome window	guardrail outcome	-0.050988
5	baseline_ticket_volume	0.010799	0.023602	pre-treatment	candidate confounder	-0.012803
3	queue_type	0.000000	0.000000	pre-treatment	candidate confounder	0.000000
0	team_id	0.000000	0.000000	unit id	unit identifier	0.000000
2	region	0.000000	0.000000	pre-treatment	candidate confounder	0.000000
1	week_start	0.000000	0.000000	time id	time identifier	0.000000
7	team_readiness	0.000000	0.000000	pre-treatment	candidate confounder	0.000000
6	baseline_satisfaction	0.000000	0.000000	pre-treatment	candidate confounder	0.000000

Discussion

This table should make us uncomfortable in two different ways.

First, post-treatment variables are missing mainly in untreated rows. That is expected and should not be imputed as if the values were accidentally lost. Second, some variables that matter for analysis, such as baseline volume or satisfaction, can be missing because of workflow. That kind of missingness should become a design question: Who is absent from the analytic sample, and is that absence related to treatment assignment or outcomes?

6. Duplicate and Unit-Integrity Checks

Duplicates are especially dangerous in panel causal analysis. A duplicated treated team-week can make the treated group look larger. A duplicated high-backlog week can also overweight a particular operational condition.

The goal is not only to count duplicates. We want to inspect whether duplicate rows disagree on fields that should be fixed for the same unit-time.

unit_cols = ['team_id', 'week_start']
duplicate_rows = raw.loc[raw.duplicated(unit_cols, keep=False)].sort_values(unit_cols)

duplicate_summary = (
    duplicate_rows
    .groupby(unit_cols)
    .agg(
        duplicate_rows=('team_id', 'size'),
        staffing_min=('staffing_hours', 'min'),
        staffing_max=('staffing_hours', 'max'),
        workload_min=('human_handled_hours', 'min'),
        workload_max=('human_handled_hours', 'max'),
    )
    .reset_index()
)
duplicate_summary['staffing_range'] = duplicate_summary['staffing_max'] - duplicate_summary['staffing_min']
duplicate_summary['workload_range'] = duplicate_summary['workload_max'] - duplicate_summary['workload_min']
duplicate_summary.head(10)

	team_id	week_start	duplicate_rows	staffing_min	staffing_max	workload_min	workload_max	staffing_range
0	T011	2025-03-03	2	493.282383	496.713121	187.039698	187.039698	3.430739
1	T015	2025-02-24	2	508.097029	508.453978	184.279689	184.279689	0.356949
2	T034	2025-01-06	2	470.258640	481.926088	202.833084	202.833084	11.667448
3	T064	2025-02-03	2	452.553596	461.421005	187.053952	187.053952	8.867410
4	T064	2025-03-03	2	502.056521	507.144791	149.135410	149.135410	5.088270
5	T071	2025-02-17	2	466.979210	468.267164	153.671606	153.671606	1.287954
6	T079	2025-01-13	2	463.032303	475.142666	266.199179	266.199179	12.110362
7	T085	2025-03-03	2	462.597699	467.323954	231.153278	231.153278	4.726255

Discussion

When duplicates disagree, we need to determine whether they are true duplicate extracts, multiple records that need aggregation, or conflicting source-system records. The causal question cannot answer this for us. A data owner must resolve the unit definition.

7. Treatment Assignment and Pre-Treatment Balance

Before estimating anything, we ask whether treatment assignment is associated with pre-treatment variables. In an experiment, randomized assignment should make large systematic differences less likely. In an observational rollout, assignment often follows readiness, queue pressure, manager priority, or customer risk.

A simple standardized mean difference table is not a full identification argument, but it is a fast profiling tool.

PRE_TREATMENT_NUMERIC = [
    'baseline_ticket_volume',
    'baseline_satisfaction',
    'team_readiness',
    'manager_priority',
    'backlog_start',
    'staffing_hours',
]


def standardized_mean_difference(df, treatment_col, variables):
    rows = []
    treated = df[df[treatment_col].eq(1)]
    control = df[df[treatment_col].eq(0)]
    for variable in variables:
        t = treated[variable].dropna()
        c = control[variable].dropna()
        pooled_sd = np.sqrt((t.var(ddof=1) + c.var(ddof=1)) / 2)
        smd = np.nan if pooled_sd == 0 or np.isnan(pooled_sd) else (t.mean() - c.mean()) / pooled_sd
        rows.append(
            {
                'variable': variable,
                'treated_mean': t.mean(),
                'control_mean': c.mean(),
                'smd': smd,
                'abs_smd': abs(smd) if not np.isnan(smd) else np.nan,
            }
        )
    return pd.DataFrame(rows).sort_values('abs_smd', ascending=False)

balance_table = standardized_mean_difference(raw, 'assistant_enabled', PRE_TREATMENT_NUMERIC)
balance_table

	variable	treated_mean	control_mean	smd	abs_smd
4	backlog_start	98.232173	114.079097	-0.898040	0.898040
5	staffing_hours	479.813009	467.306275	0.517624	0.517624
2	team_readiness	0.253501	-0.033788	0.284007	0.284007
0	baseline_ticket_volume	289.011570	279.646479	0.250091	0.250091
3	manager_priority	0.598272	0.479503	0.239765	0.239765
1	baseline_satisfaction	77.066097	76.330197	0.234774	0.234774

fig, ax = plt.subplots(figsize=(8, 4.5))
sns.barplot(data=balance_table, y='variable', x='smd', color='#4477AA', ax=ax)
ax.axvline(0, color='black', linewidth=1)
ax.axvline(0.10, color='gray', linestyle='--', linewidth=1)
ax.axvline(-0.10, color='gray', linestyle='--', linewidth=1)
ax.set_title('Pre-treatment imbalance by assistant enablement')
ax.set_xlabel('Standardized mean difference')
ax.set_ylabel('')
plt.tight_layout()
plt.show()

Discussion

The treated rows differ from untreated rows on variables measured before treatment. That does not prove confounding by itself, but it strongly suggests that a raw treated-versus-untreated contrast is not a credible causal estimate.

The AI assistant should be able to notice this pattern if we give it the balance table. It should not infer that adjustment solves everything. It should ask whether these variables are sufficient, whether treatment timing is correctly represented, and whether team-level clustering or staggered rollout matters.

8. Timing, Mediators, and Leakage Scan

A profiler for causal inference must explicitly name variables that should not be adjusted for in the primary total-effect model.

Here are three different reasons a variable can be risky:

Mediator: affected by treatment and part of the mechanism.
Outcome or guardrail: measured after treatment and should not be used as a control.
Future leakage: measured after the outcome window or using future information.

These roles are design-dependent. The same variable can be valid in one estimand and invalid in another.

role_scan = data_dictionary.copy()
role_scan['adjustment_eligibility_for_total_effect'] = np.select(
    [
        role_scan['timing'].isin(['pre-treatment', 'pre-treatment within week']),
        role_scan['initial_role'].eq('treatment'),
        role_scan['timing'].isin(['post-treatment', 'outcome window', 'future']),
    ],
    [
        'eligible candidate, subject to causal graph review',
        'not a control: treatment',
        'exclude from primary adjustment set',
    ],
    default='not an adjustment variable',
)
role_scan[['variable', 'timing', 'initial_role', 'adjustment_eligibility_for_total_effect']]

	variable	timing	initial_role	adjustment_eligibility_for_total_effect
0	team_id	unit id	unit identifier	not an adjustment variable
1	week_start	time id	time identifier	not an adjustment variable
2	region	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
3	queue_type	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
4	assistant_enabled	treatment week	treatment	not a control: treatment
5	baseline_ticket_volume	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
6	baseline_satisfaction	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
7	team_readiness	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
8	manager_priority	pre-treatment	candidate confounder	eligible candidate, subject to causal graph review
9	backlog_start	pre-treatment within week	candidate confounder	eligible candidate, subject to causal graph review
10	staffing_hours	pre-treatment within week	candidate confounder	eligible candidate, subject to causal graph review
11	agent_uses_ai	post-treatment	mediator or compliance measure	exclude from primary adjustment set
12	deflection_rate	post-treatment	mediator	exclude from primary adjustment set
13	human_handled_hours	outcome window	primary outcome	exclude from primary adjustment set
14	customer_satisfaction_score	outcome window	guardrail outcome	exclude from primary adjustment set
15	next_week_backlog	future	future leakage risk	exclude from primary adjustment set

numeric_cols = raw.select_dtypes(include='number').columns.tolist()
leakage_correlations = (
    raw[numeric_cols]
    .corr(numeric_only=True)['human_handled_hours']
    .drop('human_handled_hours')
    .abs()
    .sort_values(ascending=False)
    .rename('abs_corr_with_workload')
    .reset_index()
    .rename(columns={'index': 'variable'})
    .merge(data_dictionary[['variable', 'timing', 'initial_role']], on='variable', how='left')
)
leakage_correlations.head(10)

	variable	abs_corr_with_workload	timing	initial_role
0	next_week_backlog	0.732711	future	future leakage risk
1	backlog_start	0.697131	pre-treatment within week	candidate confounder
2	assistant_enabled	0.557883	treatment week	treatment
3	agent_uses_ai	0.331058	post-treatment	mediator or compliance measure
4	baseline_ticket_volume	0.275312	pre-treatment	candidate confounder
5	deflection_rate	0.272418	post-treatment	mediator
6	customer_satisfaction_score	0.257702	outcome window	guardrail outcome
7	team_readiness	0.201766	pre-treatment	candidate confounder
8	staffing_hours	0.191644	pre-treatment within week	candidate confounder
9	baseline_satisfaction	0.127335	pre-treatment	candidate confounder

Discussion

High predictive value is not the same as valid adjustment. next_week_backlog can be predictive because it contains future information. deflection_rate can be predictive because it lies on a treatment pathway. These variables are exactly the kind of columns that automated ML feature selection would like and a causal analyst should question.

This is where AI assistance can help as a reviewer: ask the model to classify variable roles and explain its uncertainty. But the deterministic timing table remains the anchor.

9. Regression Smoke Test: How Profiling Changes the Story

This section is not meant to estimate the final causal effect. It is a smoke test that shows how sensitive the treatment coefficient is to variable choice.

We compare three models:

A naive treated-versus-untreated regression.
A pre-treatment adjustment model.
A bad adjustment model that includes post-treatment and future variables.

The third model is intentionally wrong for the total effect. It is included because it reveals a common failure mode: the model may look more predictive while answering a different causal question.

analysis_df = raw.dropna(subset=['human_handled_hours', 'baseline_ticket_volume', 'baseline_satisfaction']).copy()

models = {
    'naive': 'human_handled_hours ~ assistant_enabled',
    'pre-treatment adjustment': (
        'human_handled_hours ~ assistant_enabled + baseline_ticket_volume + baseline_satisfaction + '
        'team_readiness + manager_priority + backlog_start + staffing_hours + C(region) + C(queue_type)'
    ),
    'bad post-treatment/leaky adjustment': (
        'human_handled_hours ~ assistant_enabled + baseline_ticket_volume + baseline_satisfaction + '
        'team_readiness + manager_priority + backlog_start + staffing_hours + C(region) + C(queue_type) + '
        'agent_uses_ai + deflection_rate + next_week_backlog'
    ),
}

coef_rows = []
for label, formula in models.items():
    model_df = analysis_df.copy()
    if 'agent_uses_ai' in formula:
        model_df = model_df.dropna(subset=['agent_uses_ai', 'deflection_rate', 'next_week_backlog'])
    fitted = smf.ols(formula, data=model_df).fit(cov_type='HC3')
    coef_rows.append(
        {
            'model': label,
            'rows_used': int(fitted.nobs),
            'assistant_enabled_coef': fitted.params.get('assistant_enabled', np.nan),
            'robust_se': fitted.bse.get('assistant_enabled', np.nan),
            'p_value': fitted.pvalues.get('assistant_enabled', np.nan),
            'r_squared': fitted.rsquared,
        }
    )

coef_table = pd.DataFrame(coef_rows)
coef_table

	model	rows_used	assistant_enabled_coef	robust_se	p_value	r_squared
0	naive	1244	-39.516817	1.680661	3.026786e-122	0.314412
1	pre-treatment adjustment	1244	-28.385061	1.257982	9.792889e-113	0.687568
2	bad post-treatment/leaky adjustment	458	13.191847	16.934753	4.359911e-01	0.576314

fig, ax = plt.subplots(figsize=(9, 3.8))
plot_coef = coef_table.assign(
    lower=lambda d: d['assistant_enabled_coef'] - 1.96 * d['robust_se'],
    upper=lambda d: d['assistant_enabled_coef'] + 1.96 * d['robust_se'],
)
ax.errorbar(
    plot_coef['assistant_enabled_coef'],
    plot_coef['model'],
    xerr=[plot_coef['assistant_enabled_coef'] - plot_coef['lower'], plot_coef['upper'] - plot_coef['assistant_enabled_coef']],
    fmt='o',
    color='#228833',
    ecolor='#777777',
    capsize=4,
)
ax.axvline(0, color='black', linewidth=1)
ax.set_title('Treatment coefficient changes with profiling decisions')
ax.set_xlabel('Coefficient on assistant_enabled')
ax.set_ylabel('')
plt.tight_layout()
plt.show()

Discussion

The bad adjustment model may use fewer rows because post-treatment variables only exist after enablement. That sample restriction alone changes the estimand. It also adjusts for variables affected by treatment, which blocks part of the treatment pathway.

The correct lesson is not that the pre-treatment model is automatically valid. The lesson is that profiling reveals which model specifications are not even candidates for the main total-effect analysis.

10. Build an Evidence Bundle for AI Review

An LLM should not receive the full raw dataset as a substitute for profiling. Instead, give it a compact evidence bundle created by deterministic code.

A good bundle includes:

Business question and intended estimand.
Row grain and unit checks.
Data dictionary with timing metadata.
Missingness summaries.
Balance diagnostics.
Duplicate examples.
Timing and leakage candidates.
A warning that the LLM should not claim identification from profiling alone.

def dataframe_records(df, max_rows=20):
    return json.loads(df.head(max_rows).to_json(orient='records', date_format='iso'))

profile_bundle = {
    'business_question': 'Did enabling the AI support assistant reduce human-handled workload without harming customer experience?',
    'intended_estimand': 'Total effect of assistant enablement on weekly human-handled workload at the team-week level.',
    'intended_row_grain': ['team_id', 'week_start'],
    'treatment': 'assistant_enabled',
    'primary_outcome': 'human_handled_hours',
    'guardrail_outcome': 'customer_satisfaction_score',
    'schema_profile': dataframe_records(schema_profile, max_rows=20),
    'data_dictionary': dataframe_records(data_dictionary, max_rows=30),
    'type_profile_top_missing': dataframe_records(type_profile.sort_values('missing_share', ascending=False), max_rows=20),
    'missing_by_treatment': dataframe_records(missing_contrast.sort_values('enabled_minus_not_enabled', key=lambda s: s.abs(), ascending=False), max_rows=16),
    'duplicate_summary': dataframe_records(duplicate_summary, max_rows=8),
    'balance_table': dataframe_records(balance_table, max_rows=12),
    'role_scan': dataframe_records(role_scan, max_rows=25),
    'leakage_correlations': dataframe_records(leakage_correlations, max_rows=12),
    'regression_smoke_test': dataframe_records(coef_table, max_rows=10),
    'explicit_warning': 'Dataset profiling can identify risks and questions, but it does not by itself establish causal identification.',
}

print(json.dumps(profile_bundle, indent=2)[:4000])

{
  "business_question": "Did enabling the AI support assistant reduce human-handled workload without harming customer experience?",
  "intended_estimand": "Total effect of assistant enablement on weekly human-handled workload at the team-week level.",
  "intended_row_grain": [
    "team_id",
    "week_start"
  ],
  "treatment": "assistant_enabled",
  "primary_outcome": "human_handled_hours",
  "guardrail_outcome": "customer_satisfaction_score",
  "schema_profile": [
    {
      "metric": "rows",
      "value": 1268
    },
    {
      "metric": "columns",
      "value": 16
    },
    {
      "metric": "unique team-weeks",
      "value": 1260
    },
    {
      "metric": "duplicated rows by intended grain",
      "value": 16
    },
    {
      "metric": "first week",
      "value": "2025-01-06T00:00:00.000"
    },
    {
      "metric": "last week",
      "value": "2025-04-07T00:00:00.000"
    },
    {
      "metric": "treated rows",
      "value": 463
    },
    {
      "metric": "untreated rows",
      "value": 805
    }
  ],
  "data_dictionary": [
    {
      "variable": "team_id",
      "description": "Support team identifier.",
      "timing": "unit id",
      "initial_role": "unit identifier"
    },
    {
      "variable": "week_start",
      "description": "Start date for the reporting week.",
      "timing": "time id",
      "initial_role": "time identifier"
    },
    {
      "variable": "region",
      "description": "Operating region for the team.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "queue_type",
      "description": "Primary queue served by the team.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "assistant_enabled",
      "description": "Whether the AI assistant was enabled for the team during the week.",
      "timing": "treatment week",
      "initial_role": "treatment"
    },
    {
      "variable": "baseline_ticket_volume",
      "description": "Average weekly ticket volume before rollout planning.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "baseline_satisfaction",
      "description": "Pre-rollout customer satisfaction score.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "team_readiness",
      "description": "Internal readiness score used by enablement managers.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "manager_priority",
      "description": "Indicator that leadership prioritized the team for early enablement.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "backlog_start",
      "description": "Open ticket backlog at the beginning of the week.",
      "timing": "pre-treatment within week",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "staffing_hours",
      "description": "Planned human staffing capacity for the week.",
      "timing": "pre-treatment within week",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "agent_uses_ai",
      "description": "Share of agents actively using the assistant after enablement.",
      "timing": "post-treatment",
      "initial_role": "mediator or compliance measure"
    },
    {
      "variable": "deflection_rate",
      "description": "Share of contacts deflected after assistant interactions.",
      "timing": "post-treatment",
      "initial_role": "mediator"
    },
    {
      "variable": "human_handled_hours",
      "description": "Human workload hours during the week.",
      "timing": "outcome window",
      "initial_role": "primary outcome"
    },
    {
      "variable": "customer_satisfaction_score",
      "description": "Customer satisfaction score for tickets resolved during the week.",
      "timing": "outco

Discussion

This bundle is intentionally lossy. It does not try to serialize the entire dataset. It preserves the evidence needed for a profiling conversation.

The advantage of this pattern is auditability. If the LLM says next_week_backlog is a leakage risk, we can point to the timing table. If it says there are duplicate team-weeks, we can point to the duplicate summary. If it invents a randomized rollout, we can catch it because the bundle never says that.

11. Shared Local LLM Runtime

The local model utilities are centralized in notebooks/_shared/local_llm.py. This matters because the model families in this course do not all load the same way. Some use standard causal language-model classes, while Gemma, Phi, and Mistral Small need special handling in this environment.

Keeping that code shared makes this notebook about causal profiling rather than model plumbing.

def find_project_root(start=None):
    start = Path(start or Path.cwd()).resolve()
    for candidate in [start, *start.parents]:
        if (candidate / 'pyproject.toml').exists() and (candidate / 'notebooks').exists():
            return candidate
    return start

PROJECT_ROOT = find_project_root()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from notebooks._shared.local_llm import (
    clean_generated_text,
    clear_loaded_model_cache,
    get_device,
    local_chat as _shared_local_chat,
)
from notebooks._shared.structured_outputs import parse_pydantic_output

DEVICE = get_device()


def local_chat(user_message, system_message=None, model_id=MODEL_ID, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE):
    return _shared_local_chat(
        user_message,
        system_message=system_message,
        model_id=model_id,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        seed=SEED,
        enabled=RUN_LIVE_LOCAL_LLM,
    )

print(f'Local LLM device: {DEVICE}')

Local LLM device: cuda

12. Structured AI Dataset Profile

We will ask the LLM for a structured profile. The schema forces the model to distinguish candidate pre-treatment covariates from post-treatment variables and leakage risks.

This is a teaching choice. Free-form prose can be useful for brainstorming, but it is harder to test. Structured output makes the AI contribution inspectable.

class DatasetProfileReview(BaseModel):
    dataset_grain: str = Field(description='The inferred observational unit and time grain.')
    candidate_treatment: str = Field(description='The treatment or exposure variable.')
    candidate_outcomes: list[str] = Field(description='Primary and guardrail outcomes.')
    pre_treatment_covariates: list[str] = Field(description='Variables that appear eligible as pre-treatment adjustment candidates.')
    post_treatment_or_mediators: list[str] = Field(description='Variables affected by treatment or measured after treatment.')
    leakage_risks: list[str] = Field(description='Variables or dataset patterns that may leak future or outcome information.')
    missingness_risks: list[str] = Field(description='Missingness patterns that could threaten the analysis.')
    unit_integrity_risks: list[str] = Field(description='Duplicate, grain, join, or aggregation risks.')
    identification_questions: list[str] = Field(description='Questions for domain owners before causal estimation.')
    recommended_exclusions_from_adjustment: list[str] = Field(description='Variables that should not be adjusted for in the primary total-effect model.')
    next_checks: list[str] = Field(description='Concrete profiling or design checks to run next.')
    confidence: Literal['low', 'medium', 'high'] = Field(description='Confidence in the profile given the evidence bundle.')

PROFILE_SCALAR_FIELDS = ['dataset_grain', 'candidate_treatment', 'confidence']
PROFILE_LIST_FIELDS = [
    'candidate_outcomes',
    'pre_treatment_covariates',
    'post_treatment_or_mediators',
    'leakage_risks',
    'missingness_risks',
    'unit_integrity_risks',
    'identification_questions',
    'recommended_exclusions_from_adjustment',
    'next_checks',
]
PROFILE_ALIASES = {
    'unit_of_analysis': 'dataset_grain',
    'grain': 'dataset_grain',
    'treatment': 'candidate_treatment',
    'outcomes': 'candidate_outcomes',
    'covariates': 'pre_treatment_covariates',
    'pre_treatment_controls': 'pre_treatment_covariates',
    'mediators': 'post_treatment_or_mediators',
    'post_treatment_variables': 'post_treatment_or_mediators',
    'excluded_controls': 'recommended_exclusions_from_adjustment',
}


def parse_dataset_profile(raw_output):
    result = parse_pydantic_output(
        raw_output,
        DatasetProfileReview,
        scalar_fields=PROFILE_SCALAR_FIELDS,
        list_fields=PROFILE_LIST_FIELDS,
        field_aliases=PROFILE_ALIASES,
        defaults={'confidence': 'medium'},
    )
    return result.parsed, result.json_text, result.notes

SYSTEM_PROFILE_MESSAGE = (
    'You are a careful causal inference data profiler.\n'
    'Rules:\n'
    '- Use only the evidence in the provided profiling bundle.\n'
    '- Return valid JSON only. No markdown. No preamble.\n'
    '- Do not claim that the dataset identifies a causal effect.\n'
    '- Do not invent randomization, instruments, policy rules, or unobserved variables.\n'
    '- Distinguish pre-treatment covariates from post-treatment variables, mediators, outcomes, and leakage risks.\n'
    '- If timing is ambiguous, put the issue in identification_questions or next_checks.\n'
    '- Be concise but specific. Use variable names from the bundle whenever possible.'
)


def build_profile_prompt(bundle):
    schema_hint = {
        'dataset_grain': 'string',
        'candidate_treatment': 'string',
        'candidate_outcomes': ['string'],
        'pre_treatment_covariates': ['string'],
        'post_treatment_or_mediators': ['string'],
        'leakage_risks': ['string'],
        'missingness_risks': ['string'],
        'unit_integrity_risks': ['string'],
        'identification_questions': ['string'],
        'recommended_exclusions_from_adjustment': ['string'],
        'next_checks': ['string'],
        'confidence': 'low | medium | high',
    }
    return textwrap.dedent(
        f'''
        Produce a DatasetProfileReview JSON object using this schema:
        {json.dumps(schema_hint, indent=2)}

        Profiling bundle:
        {json.dumps(bundle, indent=2)}
        '''
    ).strip()

profile_prompt = build_profile_prompt(profile_bundle)
print(profile_prompt[:3000])

Produce a DatasetProfileReview JSON object using this schema:
        {
  "dataset_grain": "string",
  "candidate_treatment": "string",
  "candidate_outcomes": [
    "string"
  ],
  "pre_treatment_covariates": [
    "string"
  ],
  "post_treatment_or_mediators": [
    "string"
  ],
  "leakage_risks": [
    "string"
  ],
  "missingness_risks": [
    "string"
  ],
  "unit_integrity_risks": [
    "string"
  ],
  "identification_questions": [
    "string"
  ],
  "recommended_exclusions_from_adjustment": [
    "string"
  ],
  "next_checks": [
    "string"
  ],
  "confidence": "low | medium | high"
}

        Profiling bundle:
        {
  "business_question": "Did enabling the AI support assistant reduce human-handled workload without harming customer experience?",
  "intended_estimand": "Total effect of assistant enablement on weekly human-handled workload at the team-week level.",
  "intended_row_grain": [
    "team_id",
    "week_start"
  ],
  "treatment": "assistant_enabled",
  "primary_outcome": "human_handled_hours",
  "guardrail_outcome": "customer_satisfaction_score",
  "schema_profile": [
    {
      "metric": "rows",
      "value": 1268
    },
    {
      "metric": "columns",
      "value": 16
    },
    {
      "metric": "unique team-weeks",
      "value": 1260
    },
    {
      "metric": "duplicated rows by intended grain",
      "value": 16
    },
    {
      "metric": "first week",
      "value": "2025-01-06T00:00:00.000"
    },
    {
      "metric": "last week",
      "value": "2025-04-07T00:00:00.000"
    },
    {
      "metric": "treated rows",
      "value": 463
    },
    {
      "metric": "untreated rows",
      "value": 805
    }
  ],
  "data_dictionary": [
    {
      "variable": "team_id",
      "description": "Support team identifier.",
      "timing": "unit id",
      "initial_role": "unit identifier"
    },
    {
      "variable": "week_start",
      "description": "Start date for the reporting week.",
      "timing": "time id",
      "initial_role": "time identifier"
    },
    {
      "variable": "region",
      "description": "Operating region for the team.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "queue_type",
      "description": "Primary queue served by the team.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "assistant_enabled",
      "description": "Whether the AI assistant was enabled for the team during the week.",
      "timing": "treatment week",
      "initial_role": "treatment"
    },
    {
      "variable": "baseline_ticket_volume",
      "description": "Average weekly ticket volume before rollout planning.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder"
    },
    {
      "variable": "baseline_satisfaction",
      "description": "Pre-rollout customer satisfaction score.",
      "timing": "pre-treatment",
      "initial_role": "candidate confounder

13. Run One Local Model

We start with the fast local model. The goal is not to worship one answer. The goal is to see whether the model can transform deterministic profiling evidence into a useful structured review.

if RUN_LIVE_LOCAL_LLM:
    raw_profile_output = local_chat(
        profile_prompt,
        system_message=SYSTEM_PROFILE_MESSAGE,
        model_id=MODEL_ID,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
    )
    display(Markdown('### Raw model output'))
    display(Markdown(f'```json\n{clean_generated_text(raw_profile_output)}\n```'))
else:
    raw_profile_output = ''
    print('RUN_LIVE_LOCAL_LLM is False. Skipping live local LLM call in the rendered notebook.')

Raw model output

{
 "dataset_grain": "team_id, week_start",
 "candidate_treatment": "assistant_enabled",
 "candidate_outcomes": [
 "human_handled_hours"
 ],
 "pre_treatment_covariates": [
 "region",
 "queue_type",
 "baseline_ticket_volume",
 "baseline_satisfaction",
 "team_readiness",
 "manager_priority",
 "backlog_start",
 "staffing_hours"
 ],
 "post_treatment_or_mediators": [
 "agent_uses_ai",
 "deflection_rate"
 ],
 "leakage_risks": [
 "next_week_backlog"
 ],
 "missingness_risks": [
 "deflection_rate",
 "agent_uses_ai",
 "customer_satisfaction_score"
 ],
 "unit_integrity_risks": [
 "duplicated rows by intended grain"
 ],
 "identification_questions": [
 "Is there a valid comparison group for teams that did not have the AI assistant enabled?",
 "Are there any time-varying confounders that were not captured?",
 "Does the treatment assignment appear to be random or quasi-random?"
 ],
 "recommended_exclusions_from_adjustment": [
 "agent_uses_ai",
 "deflection_rate",
 "human_handled_hours",
 "customer_satisfaction_score",
 "next_week_backlog"
 ],
 "next_checks": [
 "Verify the causal relationships between the treatment and the mediators and outcomes.",
 "Check for any omitted variable bias due to unmeasured confounders.",
 "Ensure that the treatment was applied consistently across all teams."
 ],
 "confidence": "medium"
}

if raw_profile_output:
    try:
        profile_review, profile_review_json, profile_parser_notes = parse_dataset_profile(raw_profile_output)
        single_model_parse_error = ''
        display(Markdown('### Parsed profile review'))
        display(Markdown(f'```json\n{profile_review.model_dump_json(indent=2)}\n```'))
        print('Parser notes:', profile_parser_notes)
    except Exception as error:
        profile_review = None
        profile_review_json = ''
        profile_parser_notes = []
        single_model_parse_error = clean_generated_text(repr(error))
        print('The single-model output could not be parsed. This is a useful brittleness signal, not a reason to trust the output silently.')
        print(single_model_parse_error[:1200])
else:
    profile_review = None
    profile_review_json = ''
    profile_parser_notes = []
    single_model_parse_error = ''

Parsed profile review

{
  "dataset_grain": "team_id, week_start",
  "candidate_treatment": "assistant_enabled",
  "candidate_outcomes": [
    "human_handled_hours"
  ],
  "pre_treatment_covariates": [
    "region",
    "queue_type",
    "baseline_ticket_volume",
    "baseline_satisfaction",
    "team_readiness",
    "manager_priority",
    "backlog_start",
    "staffing_hours"
  ],
  "post_treatment_or_mediators": [
    "agent_uses_ai",
    "deflection_rate"
  ],
  "leakage_risks": [
    "next_week_backlog"
  ],
  "missingness_risks": [
    "deflection_rate",
    "agent_uses_ai",
    "customer_satisfaction_score"
  ],
  "unit_integrity_risks": [
    "duplicated rows by intended grain"
  ],
  "identification_questions": [
    "Is there a valid comparison group for teams that did not have the AI assistant enabled?",
    "Are there any time-varying confounders that were not captured?",
    "Does the treatment assignment appear to be random or quasi-random?"
  ],
  "recommended_exclusions_from_adjustment": [
    "agent_uses_ai",
    "deflection_rate",
    "human_handled_hours",
    "customer_satisfaction_score",
    "next_week_backlog"
  ],
  "next_checks": [
    "Verify the causal relationships between the treatment and the mediators and outcomes.",
    "Check for any omitted variable bias due to unmeasured confounders.",
    "Ensure that the treatment was applied consistently across all teams."
  ],
  "confidence": "medium"
}

Parser notes: []

Discussion

A strong profile should do five things:

Name the row grain as team-week.
Name assistant_enabled as the treatment.
Separate outcomes from controls.
Flag agent_uses_ai, deflection_rate, and next_week_backlog as risky for the primary total-effect adjustment set.
Ask domain questions rather than pretending profiling has solved identification.

Notice that this is a different use of AI than asking, “What is the causal effect?” We are asking for an audit-oriented profile, not an estimate.

14. Audit the AI Profile

Now we grade the AI output against deterministic expectations. This is the part that makes the workflow robust. The LLM can be unstable; the audit rules remain stable.

These rules are intentionally simple and transparent. In a production workflow, you would add more checks and preserve the raw output, prompt, model ID, package versions, and data snapshot hash.

def text_blob_from_profile(review):
    if review is None:
        return ''
    return clean_generated_text(json.dumps(review.model_dump(), sort_keys=True)).lower()


def contains_any(text, terms):
    return any(term.lower() in text for term in terms)


def score_dataset_profile(review):
    text = text_blob_from_profile(review)
    checks = {
        'names team-week grain': contains_any(text, ['team-week', 'team week', 'team_id', 'week_start']),
        'names assistant_enabled treatment': 'assistant_enabled' in text,
        'names primary outcome': 'human_handled_hours' in text,
        'names guardrail outcome': 'customer_satisfaction_score' in text,
        'includes pre-treatment covariates': sum(term in text for term in ['baseline_ticket_volume', 'team_readiness', 'manager_priority', 'backlog_start', 'staffing_hours']) >= 3,
        'flags post-treatment variables': sum(term in text for term in ['agent_uses_ai', 'deflection_rate']) >= 1,
        'flags future leakage': 'next_week_backlog' in text,
        'flags missingness': contains_any(text, ['missing', 'missingness', 'baseline_ticket_volume', 'customer_satisfaction_score']),
        'flags duplicate/unit risk': contains_any(text, ['duplicate', 'duplicated', 'unit integrity', 'grain']),
        'does not claim identification': not contains_any(text, ['identifies the causal effect', 'proves the causal effect', 'establishes causality']),
        'asks domain/design questions': len(review.identification_questions) >= 2 if review is not None else False,
    }
    rows = [{'check': key, 'passed': bool(value)} for key, value in checks.items()]
    audit = pd.DataFrame(rows)
    audit['credit'] = audit['passed'].astype(int)
    score = audit['credit'].sum()
    return audit, score, len(audit)

if profile_review is not None:
    single_audit, single_score, single_max_score = score_dataset_profile(profile_review)
    print(f'Score: {single_score}/{single_max_score}')
    display(single_audit)
else:
    print('No profile review to audit because live model execution was skipped.')

Score: 11/11

	check	passed	credit
0	names team-week grain	True	1
1	names assistant_enabled treatment	True	1
2	names primary outcome	True	1
3	names guardrail outcome	True	1
4	includes pre-treatment covariates	True	1
5	flags post-treatment variables	True	1
6	flags future leakage	True	1
7	flags missingness	True	1
8	flags duplicate/unit risk	True	1
9	does not claim identification	True	1
10	asks domain/design questions	True	1

Discussion

The audit table is not a perfect judge of causal reasoning. It is a guardrail. It catches common problems:

Treating predictive post-treatment variables as ordinary controls.
Ignoring missingness or duplicate rows.
Forgetting that profiling cannot establish identification.
Producing a nice-looking but vague review with no concrete variable names.

This is the same pattern we will use repeatedly in the course: deterministic evidence, AI synthesis, structured parse, rule-based audit, human interpretation.

15. Optional All-Model Comparison

We now run the same profiling task across all locally available models. This section can take a while, especially for the larger 24B, 27B, and 32B models.

The exact ranking can change across reruns and environments. That is expected. The durable lesson is not which model wins today. The durable lesson is that dataset-profiling workflows need model comparison, failure metadata, and zero-credit handling for invalid outputs instead of quiet NaN scores.

PROFILE_EVAL_CASES = [
    {
        'case_id': 'support_assistant_profile',
        'bundle': profile_bundle,
    }
]

SUMMARY_COLUMNS = [
    'label', 'model_id', 'role', 'cases', 'schema_valid_cases', 'schema_repaired_cases',
    'schema_reliability', 'mean_profile_score', 'failure_types'
]
CASE_RESULT_COLUMNS = [
    'label', 'model_id', 'role', 'case_id', 'status', 'schema_valid', 'repair_used',
    'repair_stage', 'error_type', 'profile_score', 'max_profile_score', 'profile_score_share',
    'error', 'raw_output_preview'
]

SCHEMA_REPAIR_PROMPT_TEMPLATE = textwrap.dedent(
    '''
    Your previous answer could not be parsed as the required DatasetProfileReview JSON schema.
    Convert the previous answer into valid JSON only. Do not add new causal claims.

    Parser error:
    {error_message}

    Previous answer:
    {raw_output}
    '''
).strip()


def classify_structured_output_failure(error):
    text = clean_generated_text(repr(error)).lower()
    if 'empty model output' in text:
        return 'empty_output'
    if 'field required' in text or 'missing' in text:
        return 'missing_required_field'
    if 'input should be' in text or 'validation error' in text:
        return 'wrong_field_type_or_schema'
    if 'invalid json' in text or 'expecting value' in text or 'eof' in text or 'jsondecodeerror' in text:
        return 'invalid_json_or_truncated_output'
    return 'other_structured_output_error'


def empty_profile_review():
    return DatasetProfileReview(
        dataset_grain='',
        candidate_treatment='',
        candidate_outcomes=[],
        pre_treatment_covariates=[],
        post_treatment_or_mediators=[],
        leakage_risks=[],
        missingness_risks=[],
        unit_integrity_risks=[],
        identification_questions=[],
        recommended_exclusions_from_adjustment=[],
        next_checks=[],
        confidence='low',
    )


def parse_or_repair_profile(raw_output, model_id):
    if not clean_generated_text(raw_output):
        raise ValueError('empty model output')
    try:
        parsed, parsed_json, notes = parse_dataset_profile(raw_output)
        return {
            'parsed': parsed,
            'parsed_json': parsed_json,
            'parser_notes': notes,
            'repair_used': bool(notes),
            'repair_stage': 'parser' if notes else 'none',
            'repaired_raw_output': '',
        }
    except Exception as first_error:
        if not (RUN_SCHEMA_REPAIR_RETRY and RUN_LIVE_LOCAL_LLM):
            raise
        repair_prompt = SCHEMA_REPAIR_PROMPT_TEMPLATE.format(
            raw_output=clean_generated_text(raw_output)[:7000],
            error_message=clean_generated_text(repr(first_error))[:1200],
        )
        repaired_raw_output = local_chat(
            repair_prompt,
            system_message=SYSTEM_PROFILE_MESSAGE,
            model_id=model_id,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
        )
        parsed, parsed_json, notes = parse_dataset_profile(repaired_raw_output)
        return {
            'parsed': parsed,
            'parsed_json': parsed_json,
            'parser_notes': [f'first_parse_error: {classify_structured_output_failure(first_error)}'] + notes,
            'repair_used': True,
            'repair_stage': 'model_retry',
            'repaired_raw_output': repaired_raw_output,
        }


def run_single_model_profile_case(label, model_id, role, case):
    prompt = build_profile_prompt(case['bundle'])
    raw_output = ''
    max_score = score_dataset_profile(empty_profile_review())[2]
    try:
        raw_output = local_chat(
            prompt,
            system_message=SYSTEM_PROFILE_MESSAGE,
            model_id=model_id,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
        )
        parsed_result = parse_or_repair_profile(raw_output, model_id)
        audit, score, max_score = score_dataset_profile(parsed_result['parsed'])
        return {
            'label': label,
            'model_id': model_id,
            'role': role,
            'case_id': case['case_id'],
            'status': 'ok',
            'schema_valid': True,
            'repair_used': parsed_result['repair_used'],
            'repair_stage': parsed_result['repair_stage'],
            'error_type': '',
            'profile_score': score,
            'max_profile_score': max_score,
            'profile_score_share': score / max_score if max_score else 0.0,
            'error': '',
            'raw_output_preview': clean_generated_text(raw_output)[:500],
        }
    except Exception as error:
        return {
            'label': label,
            'model_id': model_id,
            'role': role,
            'case_id': case['case_id'],
            'status': 'failed',
            'schema_valid': False,
            'repair_used': False,
            'repair_stage': 'failed',
            'error_type': classify_structured_output_failure(error),
            'profile_score': 0,
            'max_profile_score': max_score,
            'profile_score_share': 0.0,
            'error': clean_generated_text(repr(error))[:900],
            'raw_output_preview': clean_generated_text(raw_output)[:500],
        }


def summarize_model_results(case_results):
    if case_results.empty:
        return pd.DataFrame(columns=SUMMARY_COLUMNS)
    summary = (
        case_results
        .groupby(['label', 'model_id', 'role'], as_index=False)
        .agg(
            cases=('case_id', 'count'),
            schema_valid_cases=('schema_valid', 'sum'),
            schema_repaired_cases=('repair_used', 'sum'),
            mean_profile_score=('profile_score_share', 'mean'),
            failure_types=('error_type', lambda values: sorted({value for value in values if value})),
        )
    )
    summary['schema_reliability'] = summary['schema_valid_cases'] / summary['cases']
    return summary[SUMMARY_COLUMNS].sort_values(['mean_profile_score', 'schema_reliability'], ascending=False)


def run_all_model_profile_comparison(models_to_compare=MODELS_TO_COMPARE, cases=PROFILE_EVAL_CASES):
    rows = []
    for label, model_id, role in models_to_compare:
        print(f'Running {label}: {model_id}')
        for case in cases[:MODEL_COMPARISON_CASE_LIMIT]:
            rows.append(run_single_model_profile_case(label, model_id, role, case))
        clear_loaded_model_cache()
    case_results = pd.DataFrame(rows, columns=CASE_RESULT_COLUMNS)
    summary = summarize_model_results(case_results)
    return summary, case_results

if RUN_FULL_MODEL_COMPARISON and RUN_LIVE_LOCAL_LLM:
    profile_model_summary, profile_case_results = run_all_model_profile_comparison()
else:
    profile_model_summary = pd.DataFrame(columns=SUMMARY_COLUMNS)
    profile_case_results = pd.DataFrame(columns=CASE_RESULT_COLUMNS)
    print('Full model comparison skipped. Set RUN_FULL_MODEL_COMPARISON and RUN_LIVE_LOCAL_LLM to True to run it.')

profile_model_summary

Running Qwen 0.5B: Qwen/Qwen2.5-0.5B-Instruct
Running Qwen 7B: Qwen/Qwen2.5-7B-Instruct
Running Qwen 14B: Qwen/Qwen2.5-14B-Instruct
Running Qwen 32B: Qwen/Qwen2.5-32B-Instruct
Running Phi mini: microsoft/Phi-3.5-mini-instruct

Running Mistral 7B: mistralai/Mistral-7B-Instruct-v0.3
Running Mistral Small 24B: mistralai/Mistral-Small-3.1-24B-Instruct-2503

Running Gemma 3 27B: google/gemma-3-27b-it
Running Llama 3.1 8B: meta-llama/Meta-Llama-3.1-8B-Instruct

	label	model_id	role	cases	schema_valid_cases	schema_repaired_cases	schema_reliability	mean_profile_score	failure_types
0	Gemma 3 27B	google/gemma-3-27b-it	large non-Qwen comparison	1	1	1	1.0	1.000000	[]
1	Llama 3.1 8B	meta-llama/Meta-Llama-3.1-8B-Instruct	industry-standard instruct baseline	1	1	0	1.0	1.000000	[]
2	Mistral 7B	mistralai/Mistral-7B-Instruct-v0.3	7B model-family comparison	1	1	0	1.0	1.000000	[]
3	Mistral Small 24B	mistralai/Mistral-Small-3.1-24B-Instruct-2503	strong non-Qwen comparison	1	1	1	1.0	1.000000	[]
7	Qwen 32B	Qwen/Qwen2.5-32B-Instruct	scale comparison	1	1	0	1.0	1.000000	[]
8	Qwen 7B	Qwen/Qwen2.5-7B-Instruct	fast default	1	1	0	1.0	1.000000	[]
4	Phi mini	microsoft/Phi-3.5-mini-instruct	compact non-Qwen comparison	1	1	1	1.0	0.909091	[]
5	Qwen 0.5B	Qwen/Qwen2.5-0.5B-Instruct	pipeline smoke test	1	1	0	1.0	0.909091	[]
6	Qwen 14B	Qwen/Qwen2.5-14B-Instruct	strong local analysis	1	1	0	1.0	0.909091	[]

Inspecting Failed Model Runs

A failed model run receives zero workflow credit instead of producing a NaN score. This is important because invalid JSON, empty outputs, tokenization artifacts, and truncated responses are part of the operating reality of local LLM workflows.

The table below preserves failure metadata so we can distinguish model reasoning problems from infrastructure or schema problems.

failed_model_details = profile_case_results.loc[
    ~profile_case_results['schema_valid'],
    ['label', 'status', 'error_type', 'error', 'raw_output_preview'],
].reset_index(drop=True)
failed_model_details

	label	status	error_type	error	raw_output_preview

Interpreting Repair Counts

Repairs are not automatically failures. They tell us how much cleanup was needed before the answer became usable. A model that needs frequent schema repair may still contain useful causal reasoning, but it is more expensive and risky to automate.

def summarize_repair_stages(case_results):
    if case_results.empty or 'repair_stage' not in case_results.columns:
        return pd.DataFrame(columns=['repair_stage', 'cases'])
    return (
        case_results.assign(repair_stage=case_results['repair_stage'].fillna('none'))
        .groupby('repair_stage', as_index=False)
        .agg(cases=('case_id', 'count'))
        .sort_values('cases', ascending=False)
    )

summarize_repair_stages(profile_case_results)

	repair_stage	cases
0	none	6
1	parser	3

16. Turning the Profile into a Human Review Checklist

The output of profiling should not be a final answer. It should be a checklist for the analyst and domain owner.

Below is a compact checklist template you can reuse before running a causal estimator.

causal_dataset_checklist = pd.DataFrame(
    [
        {'check': 'Row grain is verified', 'evidence in this notebook': 'team_id + week_start duplicate check', 'owner': 'analyst + data engineer'},
        {'check': 'Treatment timing is documented', 'evidence in this notebook': 'assistant_enabled and data dictionary timing', 'owner': 'domain owner'},
        {'check': 'Outcome window is after treatment', 'evidence in this notebook': 'human_handled_hours and customer_satisfaction_score timing', 'owner': 'analyst'},
        {'check': 'Pre-treatment covariates are separated from mediators', 'evidence in this notebook': 'role_scan and data dictionary', 'owner': 'analyst'},
        {'check': 'Post-treatment and future variables are excluded from total-effect adjustment', 'evidence in this notebook': 'agent_uses_ai, deflection_rate, next_week_backlog scan', 'owner': 'analyst'},
        {'check': 'Missingness mechanism is investigated', 'evidence in this notebook': 'missingness by treatment and timing', 'owner': 'analyst + data owner'},
        {'check': 'Treatment imbalance is profiled', 'evidence in this notebook': 'standardized mean differences', 'owner': 'analyst'},
        {'check': 'AI review is audited', 'evidence in this notebook': 'structured output score and failure metadata', 'owner': 'analyst'},
    ]
)
causal_dataset_checklist

	check	evidence in this notebook	owner
0	Row grain is verified	team_id + week_start duplicate check	analyst + data engineer
1	Treatment timing is documented	assistant_enabled and data dictionary timing	domain owner
2	Outcome window is after treatment	human_handled_hours and customer_satisfaction_score timing	analyst
3	Pre-treatment covariates are separated from mediators	role_scan and data dictionary	analyst
4	Post-treatment and future variables are excluded from total-effect adjustment	agent_uses_ai, deflection_rate, next_week_backlog scan	analyst
5	Missingness mechanism is investigated	missingness by treatment and timing	analyst + data owner
6	Treatment imbalance is profiled	standardized mean differences	analyst
7	AI review is audited	structured output score and failure metadata	analyst

17. Exercises

Add a new variable called same_week_resolution_rate. Decide whether it is an outcome, mediator, control, or leakage risk for the total effect of assistant enablement.
Change the missingness mechanism so customer_satisfaction_score is missing more often for treated teams. Rerun the missingness profile and describe what changes.
Modify the evidence bundle to omit timing metadata. Rerun one local model and compare the profile quality.
Add a second model-comparison case for a different estimand: the effect of actual assistant usage intensity among enabled teams. Which variables change role?
Write a stricter scoring rule that penalizes any recommendation to adjust for agent_uses_ai, deflection_rate, or next_week_backlog in the primary total-effect model.

Key Takeaways

Dataset profiling is not preliminary housekeeping. It is part of the causal design.

A strong AI-assisted profiling workflow has this shape:

Deterministic code measures the dataset.
Timing metadata maps columns to causal roles.
The LLM receives a compact evidence bundle, not an invitation to invent context.
Structured outputs make the model response testable.
Audits turn model brittleness into visible workflow evidence.
The final decision remains with the analyst and domain owner.

The practical habit is simple: let Python count, let the LLM organize and question, and let causal design decide what is admissible.