01. AI-Assisted Causal Workflow

Modern AI systems can make a causal analyst faster, more organized, and more consistent. They can draft estimand cards, brainstorm candidate DAGs, summarize domain documents, generate diagnostic code, and turn estimates into stakeholder-ready memos.

They cannot, by themselves, make a causal claim credible.

This notebook introduces a disciplined workflow for using AI as a causal-analysis copilot. The main idea is simple: use AI to produce artifacts that a trained analyst can inspect, test, revise, and document. Do not use AI as a substitute for identification logic, domain knowledge, or statistical diagnostics.

Because this course can use local models up to roughly 32B parameters, we will also treat model scale as something to evaluate. Larger models may produce better causal artifacts, but scale does not remove the need for estimands, DAGs, diagnostics, and human review.

We will also compare model families selectively. A clean Qwen ladder is useful for scale comparisons, while Phi, Mistral, Gemma, and Llama models help us see whether different pretrained and instruction-tuned systems make different causal mistakes.

Learning Goals

By the end of this notebook, you should be able to:

Explain where AI can help in a causal inference project and where it can mislead.
Convert a vague business question into a structured causal estimand card.
Use AI-style structured outputs without giving up analyst control.
Build and critique a candidate DAG for an applied project.
Run a lightweight data audit that checks treatment imbalance, timing, and bad-control risks.
Compare a naive estimate with adjusted causal estimates in a simulated observational setting.
Produce a decision memo that separates evidence, assumptions, limitations, and recommendations.
Explain why model scale and model family can change AI-assisted causal work without making causal validation optional.

Live Model Note

This course treats LLM behavior as an empirical object. These notebooks may include live local-model calls, so outputs can vary across model versions, hardware, decoding settings, prompt wording, package versions, and reruns. That instability is part of the lesson: AI-assisted causal inference requires validation, audit trails, and analyst judgment.

Treat model output as a draft artifact, not as causal evidence. A model may produce valid JSON with weak causal reasoning, or strong prose that fails schema validation.

When live calls are enabled, read the results as experiments about AI behavior:

Did the model invent design details?
Did it confuse prediction with causation?
Did it recommend bad controls?
Did it obey the schema?
Did it surface missing information?
Did it preserve uncertainty?

The goal is not to make every model output perfect. The goal is to learn how to build AI-assisted causal workflows that are auditable, constrained, and reviewed by a human analyst.

1. What AI Assistance Means in Causal Work

A useful way to think about AI in causal inference is to separate artifact generation from causal authority.

AI is useful for producing first drafts of:

causal question decompositions,
estimand cards,
candidate DAGs,
variable-role tables,
data-audit checklists,
method shortlists,
analysis code skeletons,
diagnostic narratives,
executive memos.

The analyst remains responsible for:

verifying domain assumptions,
deciding which variables are pre-treatment,
rejecting bad controls,
checking overlap and measurement quality,
choosing an identification strategy,
interpreting uncertainty,
communicating limitations honestly.

That boundary matters. AI can make a wrong causal analysis look polished. A polished wrong answer is still wrong.

2. The Workflow We Will Use

Phase	AI can help with	Analyst must own	Main artifact
Question framing	Translate vague stakeholder language into treatment, outcome, unit, and time window	Decide whether the causal question is meaningful and actionable	Estimand card
Assumptions	Brainstorm mechanisms, confounders, mediators, colliders, and selection processes	Validate the DAG against domain knowledge	DAG and adjustment set
Data audit	Draft checks for missingness, timing, balance, leakage, and overlap	Decide whether the available data can support the design	Data-readiness report
Method choice	Shortlist feasible designs and state assumptions	Choose the design and reject invalid options	Design memo
Estimation	Generate reproducible code skeletons	Inspect diagnostics and decide if estimates are credible	Notebook and result table
Reporting	Draft concise narratives for decision makers	Prevent overclaiming and document uncertainty	Decision memo

The rest of the notebook walks through these phases using a simulated industry example.

2.1. Model Scale and Model Family as Design Variables

In this course, the local LLM is not just infrastructure. It is also part of the analysis workflow we can study.

A useful model ladder is:

Model tier	Example model	Role
Smoke test	`Qwen/Qwen2.5-0.5B-Instruct`	Prove the local generation pipeline works
Fast default	`Qwen/Qwen2.5-7B-Instruct`	Main working model for quick drafts of causal artifacts
Strong local model	`Qwen/Qwen2.5-14B-Instruct`	DAG critique, estimand cards, memos, and method review
Scale comparison	`Qwen/Qwen2.5-32B-Instruct`	Study how larger models change causal reasoning quality
Model-family comparison	`microsoft/Phi-3.5-mini-instruct`	Compact non-Qwen comparison
Model-family comparison	`mistralai/Mistral-7B-Instruct-v0.3`	Alternative 7B instruct model
Strong non-Qwen comparison	`mistralai/Mistral-Small-3.1-24B-Instruct-2503`	Larger Mistral comparison below the 32B ceiling
Large non-Qwen comparison	`google/gemma-3-27b-it`	Strong Gemma-family comparison point
Industry-standard baseline	`meta-llama/Meta-Llama-3.1-8B-Instruct`	Common local instruct-model baseline

For now, treat Qwen/Qwen2.5-7B-Instruct as the practical working default and treat the larger and alternative-family models as comparison points. Notebook 02 will benchmark this ladder directly and evaluate whether scale or model family changes causal artifact quality. The result should be treated as evidence from a task-specific rubric, not as a universal ranking of models.

The point is not to assume that 32B is automatically correct or that one model family is universally best. The point is to run the same causal task across model scales and model families, then evaluate the outputs with causal criteria:

Did the model state a clear treatment, outcome, unit, and time window?
Did it distinguish association from causation?
Did it identify plausible confounders?
Did it avoid controlling for post-treatment variables?
Did it state assumptions that are required but untestable?
Did it avoid inventing facts not present in the prompt?
Did it communicate uncertainty without weakening the decision relevance?

This framing turns local LLM scale and model-family choice into teachable objects. Bigger models may be more fluent and more complete, and different model families may have different habits, but causal validity still comes from the workflow.

3. Setup

The notebook uses standard scientific Python packages plus pydantic for structured artifacts and graphviz for DAGs. No live LLM API call is required. Where an AI assistant would normally generate text or JSON, we use a simulated assistant output so the notebook remains reproducible.

import warnings
from textwrap import dedent

import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from IPython.display import Markdown, display
from pydantic import BaseModel, Field, ValidationError
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

warnings.filterwarnings('ignore')

sns.set_theme(style='whitegrid', context='notebook')
pd.set_option('display.float_format', '{:.3f}'.format)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def make_dag(edges, title=None, node_colors=None, rankdir='LR'):
    graph = graphviz.Digraph(format='svg')
    graph.attr(rankdir=rankdir, bgcolor='white', pad='0.2')
    graph.attr('node', shape='box', style='rounded,filled', color='#334155', fontname='Helvetica')
    graph.attr('edge', color='#475569', arrowsize='0.8')

    node_colors = node_colors or {}
    nodes = sorted({node for edge in edges for node in edge[:2]})
    for node in nodes:
        graph.node(node, fillcolor=node_colors.get(node, '#eef2ff'))

    for edge in edges:
        if len(edge) == 2:
            graph.edge(edge[0], edge[1])
        else:
            graph.edge(edge[0], edge[1], label=edge[2])

    if title:
        graph.attr(label=title, labelloc='t', fontsize='18', fontname='Helvetica-Bold')
    return graph


def risk_difference(frame, outcome, treatment):
    treated = frame.loc[frame[treatment] == 1, outcome]
    control = frame.loc[frame[treatment] == 0, outcome]
    estimate = treated.mean() - control.mean()
    se = np.sqrt(treated.var(ddof=1) / treated.size + control.var(ddof=1) / control.size)
    return {
        'estimate': estimate,
        'std_error': se,
        'ci_low': estimate - 1.96 * se,
        'ci_high': estimate + 1.96 * se,
        'treated_mean': treated.mean(),
        'control_mean': control.mean(),
        'n_treated': treated.size,
        'n_control': control.size,
    }


def standardized_mean_difference(frame, variable, treatment='outreach'):
    treated = frame.loc[frame[treatment] == 1, variable]
    control = frame.loc[frame[treatment] == 0, variable]
    pooled_sd = np.sqrt((treated.var(ddof=1) + control.var(ddof=1)) / 2)
    return (treated.mean() - control.mean()) / pooled_sd


def plot_estimates(table, title, true_value=None):
    fig, ax = plt.subplots(figsize=(9, 4.5))
    y = np.arange(len(table))
    ax.errorbar(
        table['estimate'],
        y,
        xerr=[table['estimate'] - table['ci_low'], table['ci_high'] - table['estimate']],
        fmt='o',
        color='#2563eb',
        ecolor='#93c5fd',
        capsize=4,
    )
    if true_value is not None:
        ax.axvline(true_value, color='#16a34a', linestyle='--', label='true ATE in simulation')
    ax.axvline(0, color='#64748b', linewidth=1)
    ax.set_yticks(y)
    ax.set_yticklabels(table['method'])
    ax.set_xlabel('Effect on churn risk')
    ax.set_title(title)
    if true_value is not None:
        ax.legend(loc='best')
    plt.tight_layout()
    return fig, ax

4. Running Example: Proactive Retention Outreach

Imagine a SaaS company asks the data science team:

Did proactive customer-success outreach reduce churn last quarter?

This sounds simple, but it is not yet a causal question. We need to clarify:

What exactly counts as outreach?
Which customers were eligible?
What is the outcome and over what time window?
Was outreach randomly assigned?
Did customer-success managers target accounts that already looked risky?
Which variables were measured before outreach?
Which variables were caused by outreach and should not be adjusted for?

We will simulate data where outreach truly reduces churn, but high-risk customers are more likely to receive outreach. That creates a classic observational-evaluation problem: the naive comparison will be badly biased.

rng = np.random.default_rng(42)
n = 6_000

company_size = rng.lognormal(mean=4.2, sigma=0.75, size=n)
tenure_months = np.clip(rng.gamma(shape=2.8, scale=8, size=n), 1, 96)
usage_score = np.clip(rng.normal(loc=0, scale=1, size=n), -3, 3)
prior_tickets = rng.poisson(np.exp(0.35 - 0.35 * usage_score + 0.25 * (company_size > 120)))
health_score = np.clip(
    67 + 9 * usage_score - 3.8 * prior_tickets + 0.06 * tenure_months + rng.normal(0, 8, n),
    1,
    99,
)
region = rng.choice(['North America', 'Europe', 'APAC', 'Latin America'], size=n, p=[0.45, 0.25, 0.22, 0.08])
segment = pd.cut(company_size, bins=[0, 60, 140, np.inf], labels=['small', 'mid-market', 'enterprise'])

region_risk = pd.Series(region).map(
    {'North America': 0.00, 'Europe': -0.08, 'APAC': 0.05, 'Latin America': 0.12}
).to_numpy()

# Historical targeting: riskier and larger accounts were more likely to receive proactive outreach.
propensity = sigmoid(
    -1.15
    + 0.045 * (70 - health_score)
    + 0.19 * prior_tickets
    + 0.45 * (company_size > 120)
    - 0.015 * tenure_months
    + region_risk
)
outreach = rng.binomial(1, propensity)

# Baseline churn risk without outreach.
p_churn_without = sigmoid(
    -1.65
    + 0.055 * (70 - health_score)
    + 0.15 * prior_tickets
    - 0.018 * tenure_months
    + region_risk
)

# True causal effect: outreach reduces churn more for fragile accounts.
true_individual_effect = (
    -0.025
    - 0.055 * (health_score < 55)
    - 0.025 * (prior_tickets >= 4)
    - 0.015 * (company_size > 120)
)
p_churn_with = np.clip(p_churn_without + true_individual_effect, 0.01, 0.95)
churn = rng.binomial(1, np.where(outreach == 1, p_churn_with, p_churn_without))

df = pd.DataFrame(
    {
        'account_id': np.arange(1, n + 1),
        'outreach': outreach,
        'churn': churn,
        'company_size': company_size,
        'tenure_months': tenure_months,
        'usage_score': usage_score,
        'prior_tickets': prior_tickets,
        'health_score': health_score,
        'region': region,
        'segment': segment.astype(str),
        'propensity_true': propensity,
        'true_individual_effect': true_individual_effect,
        'p_churn_without': p_churn_without,
        'p_churn_with': p_churn_with,
    }
)

true_ate = df['true_individual_effect'].mean()
df.head()

	account_id	outreach	churn	company_size	tenure_months	usage_score	prior_tickets	health_score	region	segment	propensity_true	true_individual_effect	p_churn_without	p_churn_with
0	1	0	1	83.809	22.699	-1.180	2	49.995	North America	mid-market	0.448	-0.080	0.341	0.261
1	2	0	0	30.570	32.702	-0.084	3	65.911	North America	small	0.292	-0.025	0.173	0.148
2	3	1	0	117.078	9.620	0.341	2	66.146	Europe	mid-market	0.306	-0.025	0.199	0.174
3	4	1	1	135.020	17.459	0.163	2	63.467	APAC	mid-market	0.441	-0.040	0.222	0.182
4	5	1	1	15.436	5.464	-1.711	3	42.650	APAC	small	0.650	-0.080	0.564	0.484

overview = pd.DataFrame(
    {
        'quantity': [
            'accounts',
            'outreach rate',
            'observed churn rate',
            'true ATE in simulation',
            'mean baseline churn risk among treated',
            'mean baseline churn risk among controls',
        ],
        'value': [
            len(df),
            df['outreach'].mean(),
            df['churn'].mean(),
            true_ate,
            df.loc[df['outreach'] == 1, 'p_churn_without'].mean(),
            df.loc[df['outreach'] == 0, 'p_churn_without'].mean(),
        ],
    }
)
overview

	quantity	value
0	accounts	6000.000
1	outreach rate	0.340
2	observed churn rate	0.222
3	true ATE in simulation	-0.046
4	mean baseline churn risk among treated	0.321
5	mean baseline churn risk among controls	0.189

Discussion

The simulated true average treatment effect is negative: outreach reduces churn risk. However, the treated accounts are much riskier at baseline because the business targeted accounts that already looked vulnerable.

This is exactly the kind of setting where AI can be useful and dangerous at the same time. Useful because it can help us organize a careful causal design. Dangerous because, if prompted carelessly, it may produce a confident narrative from the naive comparison.

5. The Naive AI Answer We Do Not Want

Suppose someone asks an AI assistant to summarize whether outreach worked and gives it only the treated and untreated churn rates. It might produce an answer like this:

Outreach accounts churned more than non-outreach accounts, so the program appears harmful.

That answer is superficially data-driven but causally wrong. It compares high-risk targeted accounts to lower-risk non-targeted accounts.

naive = risk_difference(df, outcome='churn', treatment='outreach')

pd.DataFrame(
    [
        {
            'treated churn': naive['treated_mean'],
            'control churn': naive['control_mean'],
            'naive risk difference': naive['estimate'],
            'ci low': naive['ci_low'],
            'ci high': naive['ci_high'],
            'n treated': naive['n_treated'],
            'n control': naive['n_control'],
        }
    ]
)

	treated churn	control churn	naive risk difference	ci low	ci high	n treated	n control
0	0.273	0.195	0.078	0.056	0.101	2038	3962

With this seed, the naive estimate says outreach is associated with higher churn. But we know the true simulated effect is negative. The sign is wrong because outreach was targeted toward accounts at high risk of churn.

This is the first rule of AI-assisted causal work:

Never let the language model turn an associational comparison into a causal conclusion.

6. AI-Assisted Estimand Card

A better first use of AI is not to ask for the answer. It is to ask for a structured causal design artifact.

Below we define an estimand-card schema. In a live workflow, an LLM could draft JSON matching this schema from a project brief. The analyst would then edit and approve it.

class EstimandCard(BaseModel):
    business_question: str
    causal_question: str
    unit: str
    treatment: str
    control_condition: str
    outcome: str
    time_window: str
    target_population: str
    estimand: str
    required_assumptions: list[str] = Field(default_factory=list)
    likely_threats: list[str] = Field(default_factory=list)
    variables_to_audit: list[str] = Field(default_factory=list)


simulated_ai_output = {
    'business_question': 'Did proactive customer-success outreach reduce churn last quarter?',
    'causal_question': 'What was the average causal effect of proactive outreach on 90-day churn among eligible SaaS accounts?',
    'unit': 'customer account',
    'treatment': 'receiving proactive customer-success outreach during the intervention month',
    'control_condition': 'not receiving proactive outreach during the same eligibility window',
    'outcome': 'churn within 90 days after the outreach eligibility date',
    'time_window': 'accounts eligible during the prior quarter; outcome measured over the following 90 days',
    'target_population': 'eligible active SaaS accounts that could have received proactive outreach',
    'estimand': 'average treatment effect on churn risk for the eligible-account population',
    'required_assumptions': [
        'all common causes of outreach assignment and churn are observed and adjusted for',
        'each account has a positive probability of receiving and not receiving outreach conditional on covariates',
        'pre-treatment covariates are measured before outreach assignment',
        'outreach for one account does not materially affect another account churn outcome',
    ],
    'likely_threats': [
        'customer-success managers targeted accounts using unobserved knowledge',
        'usage changes after outreach could be mistakenly controlled for',
        'enterprise accounts may have different overlap than small accounts',
        'churn timing could overlap with outreach timing for some accounts',
    ],
    'variables_to_audit': [
        'account health score',
        'prior support tickets',
        'usage before outreach',
        'company size',
        'tenure',
        'region',
        'post-outreach engagement variables',
    ],
}

estimand_card = EstimandCard.model_validate(simulated_ai_output)
estimand_card

EstimandCard(business_question='Did proactive customer-success outreach reduce churn last quarter?', causal_question='What was the average causal effect of proactive outreach on 90-day churn among eligible SaaS accounts?', unit='customer account', treatment='receiving proactive customer-success outreach during the intervention month', control_condition='not receiving proactive outreach during the same eligibility window', outcome='churn within 90 days after the outreach eligibility date', time_window='accounts eligible during the prior quarter; outcome measured over the following 90 days', target_population='eligible active SaaS accounts that could have received proactive outreach', estimand='average treatment effect on churn risk for the eligible-account population', required_assumptions=['all common causes of outreach assignment and churn are observed and adjusted for', 'each account has a positive probability of receiving and not receiving outreach conditional on covariates', 'pre-treatment covariates are measured before outreach assignment', 'outreach for one account does not materially affect another account churn outcome'], likely_threats=['customer-success managers targeted accounts using unobserved knowledge', 'usage changes after outreach could be mistakenly controlled for', 'enterprise accounts may have different overlap than small accounts', 'churn timing could overlap with outreach timing for some accounts'], variables_to_audit=['account health score', 'prior support tickets', 'usage before outreach', 'company size', 'tenure', 'region', 'post-outreach engagement variables'])

display(Markdown(
    f'''
### Estimand Card

**Business question:** {estimand_card.business_question}

**Causal question:** {estimand_card.causal_question}

**Unit:** {estimand_card.unit}

**Treatment:** {estimand_card.treatment}

**Control condition:** {estimand_card.control_condition}

**Outcome:** {estimand_card.outcome}

**Target population:** {estimand_card.target_population}

**Estimand:** {estimand_card.estimand}
    '''
))

Estimand Card

Business question: Did proactive customer-success outreach reduce churn last quarter?

Causal question: What was the average causal effect of proactive outreach on 90-day churn among eligible SaaS accounts?

Unit: customer account

Treatment: receiving proactive customer-success outreach during the intervention month

Control condition: not receiving proactive outreach during the same eligibility window

Outcome: churn within 90 days after the outreach eligibility date

Target population: eligible active SaaS accounts that could have received proactive outreach

Estimand: average treatment effect on churn risk for the eligible-account population

Discussion

The value of the estimand card is not that it is final. Its value is that it forces ambiguity into the open.

An analyst should now ask stakeholders:

Were all eligible accounts logged, or only contacted accounts?
Was the eligibility date recorded before outreach?
Did managers choose accounts using notes or informal knowledge not present in the dataset?
Are there accounts that could never receive outreach because of staffing or geography?
Are post-outreach usage metrics being mixed into the covariate table?

AI helped us get to the right conversation faster. It did not settle the causal question.

7. Prompt Pattern: Ask for Artifacts, Not Answers

A weak prompt asks:

Did outreach reduce churn?

A stronger prompt asks:

You are helping draft a causal design document. Do not estimate the answer.
Return a structured estimand card with treatment, outcome, unit, target population,
time window, likely confounders, bad-control risks, required assumptions, and open questions.
Flag anything that cannot be determined from the brief.

This pattern is safer because it channels the model toward design support rather than conclusion generation.

8. AI-Assisted DAG Brainstorming

The next useful artifact is a candidate DAG. A language model can brainstorm plausible relationships from domain text. The analyst must then edit the graph.

For this example, a plausible causal story is:

weak account health, many support tickets, low usage, size, tenure, region, and segment influence whether an account receives outreach;
those same variables also influence churn;
outreach may influence post-outreach engagement, support touchpoints, and eventual churn;
post-outreach engagement is not a pre-treatment confounder.

edges = [
    ('Account health', 'Outreach'),
    ('Account health', 'Churn'),
    ('Prior tickets', 'Outreach'),
    ('Prior tickets', 'Churn'),
    ('Usage before outreach', 'Outreach'),
    ('Usage before outreach', 'Churn'),
    ('Company size', 'Outreach'),
    ('Company size', 'Churn'),
    ('Tenure', 'Outreach'),
    ('Tenure', 'Churn'),
    ('Region', 'Outreach'),
    ('Region', 'Churn'),
    ('Segment', 'Outreach'),
    ('Segment', 'Churn'),
    ('Outreach', 'Post-outreach engagement'),
    ('Post-outreach engagement', 'Churn'),
    ('Outreach', 'Churn'),
]

node_colors = {
    'Outreach': '#fde68a',
    'Churn': '#fecaca',
    'Post-outreach engagement': '#ddd6fe',
}

make_dag(edges, title='Candidate DAG for proactive retention outreach', node_colors=node_colors)

Discussion

The DAG makes one critical warning visible: post-outreach engagement is downstream of the treatment. If we control for it while estimating the total effect of outreach, we block part of the causal pathway and can introduce post-treatment bias.

A good AI assistant should not merely draw a DAG. It should also return a variable-role table and explicitly flag variables that are dangerous to adjust for.

variable_roles = pd.DataFrame(
    [
        ('health_score', 'pre-treatment confounder', 'adjust'),
        ('prior_tickets', 'pre-treatment confounder', 'adjust'),
        ('usage_score', 'pre-treatment confounder', 'adjust'),
        ('company_size', 'pre-treatment confounder or precision variable', 'adjust'),
        ('tenure_months', 'pre-treatment confounder or precision variable', 'adjust'),
        ('region', 'pre-treatment confounder or operational constraint', 'adjust'),
        ('segment', 'pre-treatment confounder or effect modifier', 'adjust and inspect heterogeneity'),
        ('post_outreach_engagement', 'mediator or post-treatment variable', 'do not adjust for total effect'),
        ('followup_calls_after_outreach', 'post-treatment variable', 'do not adjust for total effect'),
        ('manager_notes_after_outreach', 'post-treatment measurement', 'do not adjust for total effect'),
    ],
    columns=['variable', 'role', 'recommendation'],
)

variable_roles

	variable	role	recommendation
0	health_score	pre-treatment confounder	adjust
1	prior_tickets	pre-treatment confounder	adjust
2	usage_score	pre-treatment confounder	adjust
3	company_size	pre-treatment confounder or precision variable	adjust
4	tenure_months	pre-treatment confounder or precision variable	adjust
5	region	pre-treatment confounder or operational constr...	adjust
6	segment	pre-treatment confounder or effect modifier	adjust and inspect heterogeneity
7	post_outreach_engagement	mediator or post-treatment variable	do not adjust for total effect
8	followup_calls_after_outreach	post-treatment variable	do not adjust for total effect
9	manager_notes_after_outreach	post-treatment measurement	do not adjust for total effect

9. AI-Assisted Data Audit

Before fitting models, ask the assistant to draft a data audit. Then make the audit executable.

For this example, the most important questions are:

Are treated and untreated accounts comparable on pre-treatment variables?
Is there enough overlap in treatment propensities?
Are all adjustment variables measured before outreach?
Are any post-treatment variables accidentally included?
Are there missing values or segment-specific data problems?

We will start with balance on pre-treatment covariates.

numeric_covariates = ['company_size', 'tenure_months', 'usage_score', 'prior_tickets', 'health_score']

balance_rows = []
for variable in numeric_covariates:
    balance_rows.append(
        {
            'variable': variable,
            'treated_mean': df.loc[df['outreach'] == 1, variable].mean(),
            'control_mean': df.loc[df['outreach'] == 0, variable].mean(),
            'std_mean_difference': standardized_mean_difference(df, variable),
        }
    )

balance = pd.DataFrame(balance_rows).sort_values('std_mean_difference', key=np.abs, ascending=False)
balance

	variable	treated_mean	control_mean	std_mean_difference
4	health_score	55.368	66.152	-0.779
3	prior_tickets	2.187	1.282	0.636
2	usage_score	-0.304	0.208	-0.528
0	company_size	99.240	82.109	0.217
1	tenure_months	20.827	23.281	-0.185

fig, ax = plt.subplots(figsize=(8, 4.5))
plot_data = balance.sort_values('std_mean_difference')
ax.barh(plot_data['variable'], plot_data['std_mean_difference'], color='#2563eb', alpha=0.8)
ax.axvline(0.1, color='#f97316', linestyle='--', linewidth=1, label='|SMD| = 0.10')
ax.axvline(-0.1, color='#f97316', linestyle='--', linewidth=1)
ax.axvline(0, color='#475569', linewidth=1)
ax.set_title('Pre-treatment imbalance before adjustment')
ax.set_xlabel('Standardized mean difference: treated minus control')
ax.legend(loc='lower right')
plt.tight_layout()

Discussion

The balance table should show meaningful imbalance. Treated accounts tend to have lower health scores, more prior tickets, and larger company size. That is the business targeting rule showing up in the data.

An AI assistant can help describe this imbalance, but the analyst must connect the imbalance to the identification strategy. If the treated accounts were already more likely to churn, a naive treated-versus-control comparison is not credible.

feature_cols = ['company_size', 'tenure_months', 'usage_score', 'prior_tickets', 'health_score', 'region', 'segment']
num_cols = ['company_size', 'tenure_months', 'usage_score', 'prior_tickets', 'health_score']
cat_cols = ['region', 'segment']

preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
    ]
)

propensity_model = Pipeline(
    steps=[
        ('preprocess', preprocess),
        ('model', LogisticRegression(max_iter=1_000)),
    ]
)

propensity_model.fit(df[feature_cols], df['outreach'])
df['propensity_hat'] = np.clip(propensity_model.predict_proba(df[feature_cols])[:, 1], 0.02, 0.98)

fig, ax = plt.subplots(figsize=(8, 4.5))
sns.kdeplot(data=df, x='propensity_hat', hue='outreach', common_norm=False, fill=True, alpha=0.35, ax=ax)
ax.set_title('Estimated propensity overlap')
ax.set_xlabel('Estimated probability of outreach')
plt.tight_layout()

10. AI-Assisted Method Selection

Method selection should be framed as a design question, not a package-selection question.

The assistant can draft a shortlist, but it should include reasons to reject methods. In many real projects, the most valuable AI output is not the method it recommends. It is the invalid methods it helps you rule out.

method_shortlist = pd.DataFrame(
    [
        {
            'method': 'Randomized experiment',
            'could use here?': 'not for historical estimate',
            'why': 'outreach was already assigned non-randomly; useful for a future validation test',
        },
        {
            'method': 'Regression adjustment',
            'could use here?': 'yes, as baseline',
            'why': 'pre-treatment covariates that influenced assignment are available',
        },
        {
            'method': 'Propensity weighting',
            'could use here?': 'yes, with overlap diagnostics',
            'why': 'assignment appears predictable from observed pre-treatment variables',
        },
        {
            'method': 'Doubly robust estimation',
            'could use here?': 'yes, as robustness check',
            'why': 'combines outcome modeling with propensity correction',
        },
        {
            'method': 'Difference-in-differences',
            'could use here?': 'not from current data alone',
            'why': 'would need credible pre/post panel structure and comparison trend evidence',
        },
        {
            'method': 'Regression discontinuity',
            'could use here?': 'not unless assignment used a sharp cutoff',
            'why': 'no documented threshold rule is available',
        },
    ]
)

method_shortlist

	method	could use here?	why
0	Randomized experiment	not for historical estimate	outreach was already assigned non-randomly; us...
1	Regression adjustment	yes, as baseline	pre-treatment covariates that influenced assig...
2	Propensity weighting	yes, with overlap diagnostics	assignment appears predictable from observed p...
3	Doubly robust estimation	yes, as robustness check	combines outcome modeling with propensity corr...
4	Difference-in-differences	not from current data alone	would need credible pre/post panel structure a...
5	Regression discontinuity	not unless assignment used a sharp cutoff	no documented threshold rule is available

11. Estimation: Naive, Adjusted, Weighted, and Doubly Robust

This is not a full lecture on each estimator. The purpose is to show how the workflow changes the analysis.

We will compare:

naive difference in churn rates,
regression adjustment,
inverse propensity weighting,
a simple doubly robust estimator.

Because the data are simulated, we also know the true ATE. In real data, we would not have that luxury.

# 1. Naive comparison
naive_stats = risk_difference(df, outcome='churn', treatment='outreach')

# 2. Regression-adjusted linear probability model
reg_model = smf.ols(
    'churn ~ outreach + company_size + tenure_months + usage_score + prior_tickets + health_score + C(region) + C(segment)',
    data=df,
).fit(cov_type='HC3')

# 3. Inverse propensity weighting for the ATE
e = df['propensity_hat'].to_numpy()
t = df['outreach'].to_numpy()
y = df['churn'].to_numpy()
ipw_estimate = np.mean(t * y / e - (1 - t) * y / (1 - e))

# Approximate standard error by nonparametric bootstrap for the IPW estimate.
boot_rng = np.random.default_rng(123)
boot_ipw = []
for _ in range(300):
    idx = boot_rng.integers(0, len(df), len(df))
    boot_ipw.append(np.mean(t[idx] * y[idx] / e[idx] - (1 - t[idx]) * y[idx] / (1 - e[idx])))
ipw_se = np.std(boot_ipw, ddof=1)

# 4. Simple doubly robust estimator using separate outcome models.
outcome_model_control = Pipeline(
    steps=[('preprocess', preprocess), ('model', LogisticRegression(max_iter=1_000))]
)
outcome_model_treated = Pipeline(
    steps=[('preprocess', preprocess), ('model', LogisticRegression(max_iter=1_000))]
)

outcome_model_control.fit(df.loc[df['outreach'] == 0, feature_cols], df.loc[df['outreach'] == 0, 'churn'])
outcome_model_treated.fit(df.loc[df['outreach'] == 1, feature_cols], df.loc[df['outreach'] == 1, 'churn'])

mu0 = outcome_model_control.predict_proba(df[feature_cols])[:, 1]
mu1 = outcome_model_treated.predict_proba(df[feature_cols])[:, 1]
aipw_scores = mu1 - mu0 + t * (y - mu1) / e - (1 - t) * (y - mu0) / (1 - e)
aipw_estimate = aipw_scores.mean()
aipw_se = aipw_scores.std(ddof=1) / np.sqrt(len(aipw_scores))

estimate_table = pd.DataFrame(
    [
        {
            'method': 'Naive difference',
            'estimate': naive_stats['estimate'],
            'std_error': naive_stats['std_error'],
            'ci_low': naive_stats['ci_low'],
            'ci_high': naive_stats['ci_high'],
        },
        {
            'method': 'Regression adjustment',
            'estimate': reg_model.params['outreach'],
            'std_error': reg_model.bse['outreach'],
            'ci_low': reg_model.conf_int().loc['outreach', 0],
            'ci_high': reg_model.conf_int().loc['outreach', 1],
        },
        {
            'method': 'IPW',
            'estimate': ipw_estimate,
            'std_error': ipw_se,
            'ci_low': ipw_estimate - 1.96 * ipw_se,
            'ci_high': ipw_estimate + 1.96 * ipw_se,
        },
        {
            'method': 'Doubly robust',
            'estimate': aipw_estimate,
            'std_error': aipw_se,
            'ci_low': aipw_estimate - 1.96 * aipw_se,
            'ci_high': aipw_estimate + 1.96 * aipw_se,
        },
    ]
)

estimate_table

	method	estimate	std_error	ci_low	ci_high
0	Naive difference	0.078	0.012	0.056	0.101
1	Regression adjustment	-0.044	0.012	-0.067	-0.021
2	IPW	-0.045	0.014	-0.072	-0.018
3	Doubly robust	-0.047	0.011	-0.068	-0.026

plot_estimates(
    estimate_table,
    title='Naive comparison is misleading; adjusted estimators recover the beneficial effect',
    true_value=true_ate,
);

Discussion

The naive estimate says outreach increased churn. The adjusted estimators move in the opposite direction and line up closely with the true simulated effect.

This is the point of the workflow:

The estimand card clarified the target quantity.
The DAG clarified which variables were plausible confounders and which were post-treatment.
The balance audit showed that treatment was targeted to high-risk accounts.
The method shortlist justified adjustment and weighting approaches.
The estimates became interpretable only after the design work was explicit.

AI can help produce each artifact. But the credibility comes from the causal design, not from the AI system.

12. Guardrails for AI-Generated Causal Code

If you ask an AI assistant to generate code, require the code to include guardrails. A good instruction is:

Generate estimation code only after restating the estimand and listing the adjustment variables.
Do not include post-treatment variables in the adjustment set.
Include balance diagnostics, overlap diagnostics, and a result table with confidence intervals.
Mark any assumption that cannot be tested from the data.

The goal is to make unsafe code harder to produce silently.

diagnostic_checklist = pd.DataFrame(
    [
        ('Estimand stated before estimation', 'yes', 'ATE on 90-day churn risk among eligible accounts'),
        ('Treatment measured before outcome', 'needs source-system confirmation', 'simulated data satisfy this by construction'),
        ('Adjustment variables are pre-treatment', 'mostly yes', 'post-outreach engagement explicitly excluded'),
        ('Balance checked', 'yes', 'large initial imbalance found'),
        ('Overlap checked', 'yes', 'estimated propensity distribution plotted'),
        ('Multiple estimators compared', 'yes', 'regression, IPW, and doubly robust estimates'),
        ('Unobserved targeting assessed', 'open risk', 'manager notes and informal risk signals may not be observed'),
        ('Decision caveats documented', 'yes', 'memo should distinguish evidence from assumptions'),
    ],
    columns=['check', 'status', 'notes'],
)

diagnostic_checklist

	check	status	notes
0	Estimand stated before estimation	yes	ATE on 90-day churn risk among eligible accounts
1	Treatment measured before outcome	needs source-system confirmation	simulated data satisfy this by construction
2	Adjustment variables are pre-treatment	mostly yes	post-outreach engagement explicitly excluded
3	Balance checked	yes	large initial imbalance found
4	Overlap checked	yes	estimated propensity distribution plotted
5	Multiple estimators compared	yes	regression, IPW, and doubly robust estimates
6	Unobserved targeting assessed	open risk	manager notes and informal risk signals may no...
7	Decision caveats documented	yes	memo should distinguish evidence from assumptions

13. AI-Assisted Decision Memo

Finally, AI can help draft a decision memo. The memo should not hide assumptions. It should make them prominent.

Below is a simple deterministic memo generator. In a live workflow, an LLM could draft prose from the same structured inputs, but the analyst should review every claim.

best_row = estimate_table.loc[estimate_table['method'] == 'Doubly robust'].iloc[0]

memo = f'''
### Draft Decision Memo

**Question.** Did proactive customer-success outreach reduce 90-day churn among eligible accounts?

**Estimand.** Average treatment effect of outreach on churn risk for eligible accounts in the analysis window.

**Design.** Historical outreach was not randomized. Outreach was targeted toward riskier accounts, so the naive comparison is not credible. We used pre-treatment account characteristics to adjust for observed targeting: company size, tenure, usage before outreach, prior support tickets, account health, region, and segment.

**Key diagnostic.** Treated accounts were meaningfully higher risk before outreach. This explains why the naive comparison points in the wrong direction.

**Main estimate.** The doubly robust estimate suggests outreach changed churn risk by {best_row['estimate']:.1%} points, with an approximate 95% interval from {best_row['ci_low']:.1%} to {best_row['ci_high']:.1%} points.

**Interpretation.** Under the assumption that observed pre-treatment variables capture the important targeting logic, proactive outreach appears to reduce churn. The estimate should be interpreted as evidence from an observational design, not as strong as a randomized experiment.

**Main limitation.** If customer-success managers selected accounts using unrecorded information, residual confounding may remain.

**Recommendation.** Continue the program for high-risk eligible accounts, but run a randomized holdout or phased rollout in the next cycle to validate impact with stronger design evidence.
'''

display(Markdown(memo))

Draft Decision Memo

Question. Did proactive customer-success outreach reduce 90-day churn among eligible accounts?

Estimand. Average treatment effect of outreach on churn risk for eligible accounts in the analysis window.

Design. Historical outreach was not randomized. Outreach was targeted toward riskier accounts, so the naive comparison is not credible. We used pre-treatment account characteristics to adjust for observed targeting: company size, tenure, usage before outreach, prior support tickets, account health, region, and segment.

Key diagnostic. Treated accounts were meaningfully higher risk before outreach. This explains why the naive comparison points in the wrong direction.

Main estimate. The doubly robust estimate suggests outreach changed churn risk by -4.7% points, with an approximate 95% interval from -6.8% to -2.6% points.

Interpretation. Under the assumption that observed pre-treatment variables capture the important targeting logic, proactive outreach appears to reduce churn. The estimate should be interpreted as evidence from an observational design, not as strong as a randomized experiment.

Main limitation. If customer-success managers selected accounts using unrecorded information, residual confounding may remain.

Recommendation. Continue the program for high-risk eligible accounts, but run a randomized holdout or phased rollout in the next cycle to validate impact with stronger design evidence.

14. Governance: What to Save From an AI-Assisted Analysis

A serious AI-assisted causal workflow should leave an audit trail. Save the artifacts, not just the final estimate.

Recommended artifacts:

original stakeholder brief,
prompts used to generate estimand cards or DAG drafts,
structured AI outputs,
analyst edits to those outputs,
final estimand card,
final DAG or variable-role table,
data audit results,
excluded variables and reasons,
estimation code,
diagnostics and robustness checks,
final memo with limitations.

This discipline protects the analyst. It also makes AI assistance more credible because stakeholders can see the chain from question to design to evidence.

15. Key Takeaways

AI should help create reviewable causal artifacts, not directly manufacture causal conclusions.
The right first output is often an estimand card, not an estimate.
DAGs and variable-role tables are especially useful because they expose bad-control risks.
Data audits should be executable, not only prose checklists.
Naive comparisons can be badly wrong when treatment is targeted.
AI-generated code should be constrained by explicit causal guardrails.
The final memo should separate evidence, assumptions, limitations, and recommendations.

The next notebook goes one layer deeper into LLM basics for causal analysts: prompts, structured outputs, reproducibility, temperature, hallucination risk, and practical workflow patterns.