causal-learn Tutorial 02: Synthetic Data For Causal Discovery

This notebook builds the reusable synthetic datasets used by later causal-learn tutorials. Causal discovery is easiest to learn when we know the true graph. With a known graph, we can separate three questions that are often mixed together in real data:

Did the algorithm recover the right adjacencies?
Did it orient the recoverable arrows correctly?
Did the data-generating assumptions match the algorithm’s assumptions?

The goal here is not to create a giant simulation benchmark. The goal is to create a small, inspectable dataset library that teaches the main discovery regimes: linear Gaussian data, non-Gaussian noise, nonlinear mechanisms, discrete variables, hidden confounding, and nonstationary environments.

Each dataset will be saved with:

the observed data table;
a variable dictionary;
the true graph edge table;
simple diagnostics that confirm the generated data look like the intended scenario.

Later notebooks can load these files instead of redefining structural equations from scratch.

Notebook Flow

We will build the data factory in a deliberate order:

Set up imports, output folders, random seeds, and display options.
Define the base teaching DAG shared across most datasets.
Render the true DAG in the same style as the other tutorial figures.
Define structural-equation generators for several data regimes.
Save datasets, edge tables, metadata, and diagnostics.
Inspect summaries that make the scenario differences visible.
Close with a manifest showing exactly what downstream notebooks can load.

This notebook does not run a discovery algorithm yet. It prepares the ground truth that discovery notebooks will try to recover.

Setup

The setup cell imports the scientific Python stack, prepares output folders, fixes the random seed, and records package versions. The dedicated outputs/datasets folder keeps generated CSV files separate from figures and tables. All paths are relative to this tutorial folder so the notebooks remain portable inside the repository.

from pathlib import Path
from importlib.metadata import PackageNotFoundError, version
import os
import warnings

# Keep local caches inside the repository workspace during notebook execution.
os.environ.setdefault("MPLCONFIGDIR", str(Path.cwd() / ".matplotlib_cache"))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from causallearn.graph.GraphNode import GraphNode
from causallearn.graph.Dag import Dag
from causallearn.graph.Edge import Edge
from causallearn.graph.Endpoint import Endpoint

warnings.filterwarnings("ignore", category=FutureWarning)
sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 140)

NOTEBOOK_DIR = Path.cwd()
if NOTEBOOK_DIR.name != "causal_learn":
    NOTEBOOK_DIR = Path("notebooks/tutorials/causal_learn").resolve()
else:
    NOTEBOOK_DIR = NOTEBOOK_DIR.resolve()

OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
FIGURE_DIR = OUTPUT_DIR / "figures"
TABLE_DIR = OUTPUT_DIR / "tables"
DATASET_DIR = OUTPUT_DIR / "datasets"
REPORT_DIR = OUTPUT_DIR / "reports"
for directory in [OUTPUT_DIR, FIGURE_DIR, TABLE_DIR, DATASET_DIR, REPORT_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

NOTEBOOK_PREFIX = "02"
RANDOM_SEED = 42
N_ROWS = 2_500
rng = np.random.default_rng(RANDOM_SEED)


def pkg_version(package_name: str) -> str:
    """Return a package version string without failing if metadata is unavailable."""
    try:
        return version(package_name)
    except PackageNotFoundError:
        return "not installed"

version_table = pd.DataFrame(
    [
        {"package": "causal-learn", "version": pkg_version("causal-learn")},
        {"package": "numpy", "version": pkg_version("numpy")},
        {"package": "pandas", "version": pkg_version("pandas")},
        {"package": "matplotlib", "version": pkg_version("matplotlib")},
        {"package": "seaborn", "version": pkg_version("seaborn")},
    ]
)
version_table

	package	version
0	causal-learn	0.1.4.5
1	numpy	2.4.4
2	pandas	3.0.2
3	matplotlib	3.10.9
4	seaborn	0.13.2

The version table and random seed make this notebook repeatable. If a later graph-recovery result changes, we can check whether the data changed, the package version changed, or the discovery settings changed.

Dataset Design Principles

Synthetic discovery data should be simple enough to audit but varied enough to stress different assumptions. This registry describes the scenarios we will generate. Each scenario has a clear purpose: some are friendly to PC-style methods, some are better suited to functional methods, and some intentionally violate assumptions so later notebooks can show failure modes.

dataset_registry = pd.DataFrame(
    [
        {
            "dataset_name": "linear_gaussian",
            "row_count": N_ROWS,
            "variable_type": "continuous",
            "main_stress_test": "Linear additive mechanisms with Gaussian noise.",
            "use_later_for": "PC, Fisher-Z tests, GES, baseline graph recovery.",
            "known_limitation": "Purely observational data may not orient every Markov-equivalent edge.",
        },
        {
            "dataset_name": "linear_nongaussian",
            "row_count": N_ROWS,
            "variable_type": "continuous",
            "main_stress_test": "Linear mechanisms with non-Gaussian noise.",
            "use_later_for": "LiNGAM-style direction learning and non-Gaussian diagnostics.",
            "known_limitation": "Non-Gaussianity helps only when the linear model is a reasonable approximation.",
        },
        {
            "dataset_name": "nonlinear_continuous",
            "row_count": N_ROWS,
            "variable_type": "continuous",
            "main_stress_test": "Nonlinear parent effects and interactions.",
            "use_later_for": "Kernel tests, nonlinear functional methods, robustness checks.",
            "known_limitation": "Linear tests may miss or distort nonlinear dependence.",
        },
        {
            "dataset_name": "discrete_mixed",
            "row_count": N_ROWS,
            "variable_type": "binary and ordinal",
            "main_stress_test": "Discrete outcomes generated from latent logits.",
            "use_later_for": "Discrete conditional-independence tests and mixed-data cautions.",
            "known_limitation": "Treating these variables as Gaussian continuous data is a modeling mismatch.",
        },
        {
            "dataset_name": "hidden_confounder_observed",
            "row_count": N_ROWS,
            "variable_type": "continuous with one omitted cause",
            "main_stress_test": "A latent variable affects multiple observed variables.",
            "use_later_for": "FCI/PAG examples and hidden-confounding sensitivity.",
            "known_limitation": "The observed variables alone do not satisfy causal sufficiency.",
        },
        {
            "dataset_name": "nonstationary_continuous",
            "row_count": N_ROWS,
            "variable_type": "continuous plus environment label",
            "main_stress_test": "Mechanisms and distributions shift across environments.",
            "use_later_for": "CD-NOD and stability diagnostics across environments.",
            "known_limitation": "Pooling environments can hide mechanism changes.",
        },
    ]
)

dataset_registry.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_dataset_registry.csv", index=False)
dataset_registry

	dataset_name	row_count	variable_type	main_stress_test	use_later_for	known_limitation
0	linear_gaussian	2500	continuous	Linear additive mechanisms with Gaussian noise.	PC, Fisher-Z tests, GES, baseline graph recovery.	Purely observational data may not orient every Markov-equivalent edge.
1	linear_nongaussian	2500	continuous	Linear mechanisms with non-Gaussian noise.	LiNGAM-style direction learning and non-Gaussian diagnostics.	Non-Gaussianity helps only when the linear model is a reasonable approximation.
2	nonlinear_continuous	2500	continuous	Nonlinear parent effects and interactions.	Kernel tests, nonlinear functional methods, robustness checks.	Linear tests may miss or distort nonlinear dependence.
3	discrete_mixed	2500	binary and ordinal	Discrete outcomes generated from latent logits.	Discrete conditional-independence tests and mixed-data cautions.	Treating these variables as Gaussian continuous data is a modeling mismatch.
4	hidden_confounder_observed	2500	continuous with one omitted cause	A latent variable affects multiple observed variables.	FCI/PAG examples and hidden-confounding sensitivity.	The observed variables alone do not satisfy causal sufficiency.
5	nonstationary_continuous	2500	continuous plus environment label	Mechanisms and distributions shift across environments.	CD-NOD and stability diagnostics across environments.	Pooling environments can hide mechanism changes.

The registry is the notebook’s contract with later tutorials. Each generated file should be used for the algorithm family it was designed to teach, not as a universal benchmark. This prevents an easy mistake: judging an algorithm harshly on a dataset that violates its core assumptions without saying so.

Base Teaching DAG

Most datasets share the same observed causal graph. The variable names match the introductory causal-learn notebook so the tutorial series feels continuous:

need and intent are exogenous drivers.
match depends on need and intent.
engagement depends on match.
renewal depends on intent and engagement.
support depends on engagement.

This graph is a teaching DAG, not a claim about any real product system. Its purpose is to create known parent-child relationships that are easy to inspect.

base_nodes = ["need", "intent", "match", "engagement", "renewal", "support"]
base_edge_table = pd.DataFrame(
    [
        {"source": "need", "target": "match", "edge_type": "directed", "mechanism": "Need changes what a good match means."},
        {"source": "intent", "target": "match", "edge_type": "directed", "mechanism": "Current intent changes recommendation relevance."},
        {"source": "match", "target": "engagement", "edge_type": "directed", "mechanism": "Better matching increases engagement depth."},
        {"source": "intent", "target": "renewal", "edge_type": "directed", "mechanism": "Intent directly affects later value."},
        {"source": "engagement", "target": "renewal", "edge_type": "directed", "mechanism": "Engagement contributes to renewal value."},
        {"source": "engagement", "target": "support", "edge_type": "directed", "mechanism": "Engagement creates more chances for support contact."},
    ]
)

base_edge_table.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_base_true_dag_edges.csv", index=False)
base_edge_table

	source	target	edge_type	mechanism
0	need	match	directed	Need changes what a good match means.
1	intent	match	directed	Current intent changes recommendation relevance.
2	match	engagement	directed	Better matching increases engagement depth.
3	intent	renewal	directed	Intent directly affects later value.
4	engagement	renewal	directed	Engagement contributes to renewal value.
5	engagement	support	directed	Engagement creates more chances for support contact.

The edge table is the ground truth for the base scenarios. Later notebooks can compare a learned graph against this table to compute adjacency and arrow metrics. Keeping the graph in table form also makes it easy to save, reload, and inspect without relying only on a picture.

Render The Base DAG

The figure uses the same visual style as the other tutorial DAGs: a wide canvas, rounded pastel boxes, bold labels, and dark annotation arrows with clear arrowhead spacing. Visual consistency matters because these notebooks are meant to be read as a tutorial sequence.

base_labels = {
    "need": "Need\nscore",
    "intent": "Intent\nsignal",
    "match": "Match\nquality",
    "engagement": "Engagement",
    "renewal": "Renewal\nvalue",
    "support": "Support\nload",
}

base_positions = {
    "need": (0.10, 0.76),
    "intent": (0.10, 0.24),
    "match": (0.34, 0.52),
    "engagement": (0.66, 0.52),
    "renewal": (0.90, 0.72),
    "support": (0.90, 0.30),
}

base_node_colors = {
    "need": "#eef2ff",
    "intent": "#eef2ff",
    "match": "#e0f2fe",
    "engagement": "#e0f2fe",
    "renewal": "#dcfce7",
    "support": "#dcfce7",
}

base_edge_radii = {
    ("need", "match"): -0.04,
    ("intent", "match"): 0.04,
    ("match", "engagement"): 0.00,
    ("intent", "renewal"): 0.18,
    ("engagement", "renewal"): -0.04,
    ("engagement", "support"): 0.04,
}


def draw_teaching_style_dag(edge_table, labels, positions, node_colors, title, path, edge_radii=None):
    """Draw a small DAG using the shared tutorial visual style."""
    edge_radii = edge_radii or {}
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_axis_off()

    for row in edge_table.itertuples(index=False):
        source = row.source
        target = row.target
        edge_type = getattr(row, "edge_type", "directed")
        ax.annotate(
            "",
            xy=positions[target],
            xytext=positions[source],
            arrowprops=dict(
                arrowstyle="-|>",
                color="#334155",
                linewidth=1.5,
                mutation_scale=18,
                shrinkA=34,
                shrinkB=46,
                linestyle="--" if edge_type == "latent" else "-",
                connectionstyle=f"arc3,rad={edge_radii.get((source, target), 0.0)}",
            ),
            zorder=1,
        )

    for node, (x, y) in positions.items():
        ax.text(
            x,
            y,
            labels[node],
            ha="center",
            va="center",
            fontsize=10.5,
            fontweight="bold",
            bbox=dict(
                boxstyle="round,pad=0.45",
                facecolor=node_colors.get(node, "#e0f2fe"),
                edgecolor="#334155",
                linewidth=1.2,
            ),
            zorder=2,
        )

    ax.set_title(title, pad=18)
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(path, dpi=160, bbox_inches="tight")
    plt.show()
    return path

base_dag_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_base_teaching_dag.png"
draw_teaching_style_dag(
    base_edge_table,
    base_labels,
    base_positions,
    base_node_colors,
    "Base Teaching DAG",
    base_dag_path,
    edge_radii=base_edge_radii,
)

PosixPath('/home/apex/Documents/ranking_sys/notebooks/tutorials/causal_learn/outputs/figures/02_base_teaching_dag.png')

The picture shows the two-parent collider at match, the downstream role of engagement, and the long direct path from intent to renewal. Later algorithm notebooks can ask whether the learned graph finds those relationships or leaves some directions unresolved.

causal-learn Graph Object For The Ground Truth

The saved edge table is the most portable representation, but later graph metrics often expect causal-learn graph objects. This cell builds a Dag object from the base edge table and saves a true adjacency matrix. The matrix is useful for programmatic checks; the edge table remains better for humans.

def build_causallearn_dag(node_names, edge_table):
    """Convert a directed edge table into a causal-learn Dag object."""
    graph_nodes = [GraphNode(name) for name in node_names]
    node_map = {node.get_name(): node for node in graph_nodes}
    dag = Dag(graph_nodes)
    for row in edge_table.itertuples(index=False):
        dag.add_edge(Edge(node_map[row.source], node_map[row.target], Endpoint.TAIL, Endpoint.ARROW))
    return dag

base_dag = build_causallearn_dag(base_nodes, base_edge_table)
base_adjacency = pd.DataFrame(0, index=base_nodes, columns=base_nodes, dtype=int)
for row in base_edge_table.itertuples(index=False):
    base_adjacency.loc[row.source, row.target] = 1

base_adjacency.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_base_true_adjacency_matrix.csv")
causal_learn_edge_strings = pd.DataFrame({"causal_learn_edge": [str(edge) for edge in base_dag.get_graph_edges()]})
causal_learn_edge_strings.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_base_causallearn_edge_strings.csv", index=False)
causal_learn_edge_strings

	causal_learn_edge
0	need --> match
1	intent --> match
2	intent --> renewal
3	match --> engagement
4	engagement --> renewal
5	engagement --> support

The causal-learn edge strings confirm that every base edge is directed. The adjacency matrix is intentionally simple: a 1 means the row variable is a direct parent of the column variable in the true teaching graph.

Variable Dictionary

Before generating values, we define what each observed field means. A variable dictionary is especially important in causal discovery because variable names alone do not tell us which columns are roots, intermediate variables, outcomes, or context variables.

variable_dictionary = pd.DataFrame(
    [
        {
            "variable": "need",
            "role": "root cause",
            "type_in_base_datasets": "continuous",
            "true_parents": "none",
            "plain_language": "Baseline user need or demand intensity.",
        },
        {
            "variable": "intent",
            "role": "root cause",
            "type_in_base_datasets": "continuous",
            "true_parents": "none",
            "plain_language": "Current session intent or motivation signal.",
        },
        {
            "variable": "match",
            "role": "mediating variable",
            "type_in_base_datasets": "continuous",
            "true_parents": "need, intent",
            "plain_language": "How well the surfaced option matches the user's situation.",
        },
        {
            "variable": "engagement",
            "role": "mediating variable",
            "type_in_base_datasets": "continuous",
            "true_parents": "match",
            "plain_language": "Depth of downstream engagement after the match quality is realized.",
        },
        {
            "variable": "renewal",
            "role": "outcome",
            "type_in_base_datasets": "continuous",
            "true_parents": "intent, engagement",
            "plain_language": "Future value or renewal-like outcome proxy.",
        },
        {
            "variable": "support",
            "role": "outcome",
            "type_in_base_datasets": "continuous",
            "true_parents": "engagement",
            "plain_language": "Support-contact load or friction outcome proxy.",
        },
        {
            "variable": "environment",
            "role": "context variable",
            "type_in_base_datasets": "integer label",
            "true_parents": "none",
            "plain_language": "Observed environment or regime label used only in the nonstationary dataset.",
        },
        {
            "variable": "latent_demand",
            "role": "unobserved common cause",
            "type_in_base_datasets": "continuous",
            "true_parents": "none",
            "plain_language": "Hidden variable saved only in the full hidden-confounder file for teaching diagnostics.",
        },
    ]
)

variable_dictionary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_variable_dictionary.csv", index=False)
variable_dictionary

	variable	role	type_in_base_datasets	true_parents	plain_language
0	need	root cause	continuous	none	Baseline user need or demand intensity.
1	intent	root cause	continuous	none	Current session intent or motivation signal.
2	match	mediating variable	continuous	need, intent	How well the surfaced option matches the user's situation.
3	engagement	mediating variable	continuous	match	Depth of downstream engagement after the match quality is realized.
4	renewal	outcome	continuous	intent, engagement	Future value or renewal-like outcome proxy.
5	support	outcome	continuous	engagement	Support-contact load or friction outcome proxy.
6	environment	context variable	integer label	none	Observed environment or regime label used only in the nonstationary dataset.
7	latent_demand	unobserved common cause	continuous	none	Hidden variable saved only in the full hidden-confounder file for teaching diagnostics.

The dictionary makes the downstream modeling assumptions explicit. For example, environment is not a normal causal variable in the base DAG; it is a context label used when we intentionally introduce distribution shift.

Shared Simulation Helpers

The next cell defines small helper functions used by all scenarios. The most important helper is standardize, which keeps variables on comparable scales. This makes later graph-discovery behavior easier to inspect because one column will not dominate simply because it has a much larger numeric range.

def standardize(values):
    """Return a centered, unit-variance version of a numeric array."""
    values = np.asarray(values, dtype=float)
    return (values - values.mean()) / values.std(ddof=0)


def sigmoid(values):
    """Stable logistic transform for generating probabilities."""
    values = np.asarray(values, dtype=float)
    return 1 / (1 + np.exp(-np.clip(values, -30, 30)))


def draw_noise(rng, n_rows, distribution="normal", scale=1.0):
    """Draw noise with a named distribution and standardize it before scaling."""
    if distribution == "normal":
        values = rng.normal(0, 1, n_rows)
    elif distribution == "laplace":
        values = rng.laplace(0, 1, n_rows)
    elif distribution == "student_t":
        values = rng.standard_t(df=4, size=n_rows)
    elif distribution == "uniform":
        values = rng.uniform(-1, 1, n_rows)
    else:
        raise ValueError(f"Unknown noise distribution: {distribution}")
    return scale * standardize(values)


def save_dataset(name, data, edge_table, notes):
    """Save one synthetic dataset and its matching edge table."""
    data_path = DATASET_DIR / f"{NOTEBOOK_PREFIX}_{name}.csv"
    edge_path = TABLE_DIR / f"{NOTEBOOK_PREFIX}_{name}_true_edges.csv"
    note_path = TABLE_DIR / f"{NOTEBOOK_PREFIX}_{name}_notes.csv"

    data.to_csv(data_path, index=False)
    edge_table.to_csv(edge_path, index=False)
    pd.DataFrame(notes).to_csv(note_path, index=False)

    return {
        "dataset_name": name,
        "data_path": str(data_path.relative_to(NOTEBOOK_DIR)),
        "edge_path": str(edge_path.relative_to(NOTEBOOK_DIR)),
        "notes_path": str(note_path.relative_to(NOTEBOOK_DIR)),
        "rows": len(data),
        "columns": data.shape[1],
    }

"helpers ready"

'helpers ready'

These helpers keep the structural-equation cells focused on causal mechanisms rather than file handling. The saved note file for each dataset records why the scenario exists and what assumption it is meant to test.

Linear Gaussian Dataset

The first dataset is the friendly baseline for constraint-based discovery with Fisher-Z tests. Every mechanism is linear and additive, the noise is Gaussian, and all common causes in the base graph are observed. This is the cleanest setting for early PC and GES examples.

def simulate_linear_gaussian(n_rows, seed):
    """Generate data from the base DAG with linear Gaussian structural equations."""
    local_rng = np.random.default_rng(seed)
    need = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    intent = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    match = standardize(0.80 * need + 0.90 * intent + draw_noise(local_rng, n_rows, "normal", scale=0.70))
    engagement = standardize(1.10 * match + draw_noise(local_rng, n_rows, "normal", scale=0.75))
    renewal = standardize(0.70 * intent + 0.55 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.80))
    support = standardize(0.65 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.90))

    return pd.DataFrame(
        {
            "need": need,
            "intent": intent,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )

linear_gaussian = simulate_linear_gaussian(N_ROWS, RANDOM_SEED + 1)
linear_gaussian.head()

	need	intent	match	engagement	renewal	support
0	0.249820	-0.372094	0.060245	0.667197	0.252766	-0.280058
1	0.683671	-0.210471	0.904969	1.004727	0.320095	-0.332215
2	-0.579752	-1.202671	-0.578579	-0.235444	-0.732431	0.594102
3	-0.902823	-0.077309	-0.771219	-0.531128	-0.105721	-1.503551
4	-1.985745	0.087297	-0.691315	-1.281731	-0.797906	-0.328219

The first rows are centered continuous variables. The values are not meant to have real-world units. They are standardized signals designed to make the graph-recovery problem easy to inspect.

Linear Non-Gaussian Dataset

This dataset keeps the same linear graph but changes the noise distribution. Non-Gaussian noise is important because some functional causal discovery methods can identify directions that remain ambiguous under purely Gaussian observational assumptions.

def simulate_linear_nongaussian(n_rows, seed):
    """Generate data from the base DAG with linear mechanisms and non-Gaussian noise."""
    local_rng = np.random.default_rng(seed)
    need = draw_noise(local_rng, n_rows, "laplace", scale=1.0)
    intent = draw_noise(local_rng, n_rows, "student_t", scale=1.0)
    match = standardize(0.80 * need + 0.90 * intent + draw_noise(local_rng, n_rows, "laplace", scale=0.70))
    engagement = standardize(1.10 * match + draw_noise(local_rng, n_rows, "student_t", scale=0.75))
    renewal = standardize(0.70 * intent + 0.55 * engagement + draw_noise(local_rng, n_rows, "laplace", scale=0.80))
    support = standardize(0.65 * engagement + draw_noise(local_rng, n_rows, "student_t", scale=0.90))

    return pd.DataFrame(
        {
            "need": need,
            "intent": intent,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )

linear_nongaussian = simulate_linear_nongaussian(N_ROWS, RANDOM_SEED + 2)
linear_nongaussian.head()

	need	intent	match	engagement	renewal	support
0	-0.980023	1.665654	0.496861	0.751623	1.138663	0.568429
1	-0.459670	-0.094741	-0.025272	-1.332604	-0.338999	-1.008301
2	-0.143588	-0.640327	-0.119029	0.149667	-0.310379	-0.448921
3	1.949271	-0.678107	0.036265	-0.200098	-0.704465	-1.239684
4	-0.783755	0.859084	0.202363	0.302841	-1.131283	0.398547

The graph is unchanged, but the marginal distributions are less Gaussian. This lets later notebooks show how a discovery method can depend on both the graph structure and the noise assumptions.

Nonlinear Continuous Dataset

The third dataset keeps the same parent sets but changes the functional form. Parent effects include nonlinear transforms and interactions. This scenario is useful for showing why linear partial-correlation tests can miss dependence that a nonlinear test might detect.

def simulate_nonlinear_continuous(n_rows, seed):
    """Generate continuous data with nonlinear mechanisms on the base DAG."""
    local_rng = np.random.default_rng(seed)
    need = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    intent = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    match = standardize(
        0.75 * np.sin(need)
        + 0.65 * np.tanh(intent)
        + 0.35 * need * intent
        + draw_noise(local_rng, n_rows, "normal", scale=0.55)
    )
    engagement = standardize(
        0.85 * np.tanh(1.2 * match)
        + 0.25 * (match ** 2 - np.mean(match ** 2))
        + draw_noise(local_rng, n_rows, "normal", scale=0.55)
    )
    renewal = standardize(
        0.65 * np.sin(intent)
        + 0.55 * engagement
        + 0.25 * engagement * intent
        + draw_noise(local_rng, n_rows, "normal", scale=0.70)
    )
    support = standardize(
        0.45 * engagement
        + 0.35 * np.maximum(engagement, 0)
        + draw_noise(local_rng, n_rows, "normal", scale=0.80)
    )

    return pd.DataFrame(
        {
            "need": need,
            "intent": intent,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )

nonlinear_continuous = simulate_nonlinear_continuous(N_ROWS, RANDOM_SEED + 3)
nonlinear_continuous.head()

	need	intent	match	engagement	renewal	support
0	-0.449143	0.839393	-0.774023	-0.788501	0.406688	1.008769
1	-0.613123	-0.143268	0.425743	0.103366	0.461132	0.676819
2	0.140370	-0.661734	-1.193525	0.018633	0.272675	-0.311880
3	0.844631	-1.028607	-0.631918	0.558152	-0.679029	1.402147
4	-0.141676	-0.125488	-0.391339	-1.295949	0.464241	-1.304840

The columns still look like ordinary continuous variables, but the parent-child relationships are no longer purely linear. This is the kind of situation where a correlation matrix can understate the true causal dependence.

Discrete And Mixed Dataset

Many real discovery problems involve binary or ordinal variables. This dataset creates discrete variables from latent logits while preserving the same broad causal ordering. It is intentionally not suitable for a vanilla Gaussian workflow unless we first justify that approximation.

def simulate_discrete_mixed(n_rows, seed):
    """Generate binary and ordinal data from latent logits on the base DAG."""
    local_rng = np.random.default_rng(seed)
    need_prob = sigmoid(draw_noise(local_rng, n_rows, "normal", scale=0.9))
    intent_prob = sigmoid(draw_noise(local_rng, n_rows, "normal", scale=0.9))
    need = local_rng.binomial(1, need_prob)
    intent = local_rng.binomial(1, intent_prob)

    match_prob = sigmoid(-0.8 + 1.25 * need + 1.15 * intent + draw_noise(local_rng, n_rows, "normal", scale=0.45))
    match = local_rng.binomial(1, match_prob)

    engagement_latent = -0.7 + 1.4 * match + draw_noise(local_rng, n_rows, "normal", scale=0.85)
    engagement = pd.cut(
        engagement_latent,
        bins=[-np.inf, -0.45, 0.75, np.inf],
        labels=[0, 1, 2],
    ).astype(int)

    renewal_prob = sigmoid(-0.9 + 0.85 * intent + 0.65 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.55))
    renewal = local_rng.binomial(1, renewal_prob)

    support_prob = sigmoid(-1.2 + 0.70 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.70))
    support = local_rng.binomial(1, support_prob)

    return pd.DataFrame(
        {
            "need": need,
            "intent": intent,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )

discrete_mixed = simulate_discrete_mixed(N_ROWS, RANDOM_SEED + 4)
discrete_mixed.head()

	need	intent	match	engagement	renewal	support
0	0	0	0	0	0	1
1	1	0	1	1	1	1
2	0	0	1	1	0	0
3	1	1	1	2	0	0
4	1	1	1	1	1	1

The discrete table has binary roots and outcomes, with engagement as a three-level ordinal variable. Later notebooks can use it to discuss why the choice of conditional-independence test must match the data type.

Hidden-Confounder Dataset

The hidden-confounder scenario adds an unobserved latent_demand variable. It affects both match and renewal, which means the observed variables alone violate causal sufficiency. We save two versions:

a full diagnostic file that includes latent_demand;
an observed file that omits latent_demand, matching what a discovery algorithm would see in a hidden-confounding example.

This prepares the ground for FCI and PAG tutorials.

hidden_labels = {
    **base_labels,
    "latent_demand": "Latent\ndemand",
}
hidden_positions = {
    **base_positions,
    "latent_demand": (0.42, 0.88),
}
hidden_node_colors = {
    **base_node_colors,
    "latent_demand": "#f3f4f6",
}
hidden_edge_table = pd.concat(
    [
        base_edge_table,
        pd.DataFrame(
            [
                {
                    "source": "latent_demand",
                    "target": "match",
                    "edge_type": "latent",
                    "mechanism": "Unobserved demand makes better matches more likely.",
                },
                {
                    "source": "latent_demand",
                    "target": "renewal",
                    "edge_type": "latent",
                    "mechanism": "The same unobserved demand also affects renewal.",
                },
            ]
        ),
    ],
    ignore_index=True,
)
hidden_edge_radii = {
    **base_edge_radii,
    ("latent_demand", "match"): 0.04,
    ("latent_demand", "renewal"): -0.12,
}

hidden_dag_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_hidden_confounder_true_dag.png"
draw_teaching_style_dag(
    hidden_edge_table,
    hidden_labels,
    hidden_positions,
    hidden_node_colors,
    "Hidden-Confounder Teaching DAG",
    hidden_dag_path,
    edge_radii=hidden_edge_radii,
)

hidden_edge_table.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_hidden_confounder_full_true_edges.csv", index=False)
hidden_edge_table

	source	target	edge_type	mechanism
0	need	match	directed	Need changes what a good match means.
1	intent	match	directed	Current intent changes recommendation relevance.
2	match	engagement	directed	Better matching increases engagement depth.
3	intent	renewal	directed	Intent directly affects later value.
4	engagement	renewal	directed	Engagement contributes to renewal value.
5	engagement	support	directed	Engagement creates more chances for support contact.
6	latent_demand	match	latent	Unobserved demand makes better matches more likely.
7	latent_demand	renewal	latent	The same unobserved demand also affects renewal.

The dashed latent edges show why this scenario is different from the base DAG. A learner that only sees observed variables cannot condition on latent_demand, so some observed relationships may look like direct or ambiguous causal connections.

Generate The Hidden-Confounder Values

This cell generates the hidden-confounder data. The full table includes the latent variable so we can verify the data-generating process. The observed table removes it, because that is the realistic discovery input for a hidden-confounding tutorial.

def simulate_hidden_confounder(n_rows, seed):
    """Generate data where one common cause is hidden from the observed dataset."""
    local_rng = np.random.default_rng(seed)
    need = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    intent = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    latent_demand = draw_noise(local_rng, n_rows, "normal", scale=1.0)
    match = standardize(
        0.75 * need
        + 0.85 * intent
        + 0.80 * latent_demand
        + draw_noise(local_rng, n_rows, "normal", scale=0.65)
    )
    engagement = standardize(1.00 * match + draw_noise(local_rng, n_rows, "normal", scale=0.75))
    renewal = standardize(
        0.65 * intent
        + 0.55 * engagement
        + 0.70 * latent_demand
        + draw_noise(local_rng, n_rows, "normal", scale=0.80)
    )
    support = standardize(0.60 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.90))

    full = pd.DataFrame(
        {
            "need": need,
            "intent": intent,
            "latent_demand": latent_demand,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )
    observed = full.drop(columns="latent_demand")
    return observed, full

hidden_confounder_observed, hidden_confounder_full = simulate_hidden_confounder(N_ROWS, RANDOM_SEED + 5)
hidden_confounder_full.head()

	need	intent	latent_demand	match	engagement	renewal	support
0	-0.670177	0.916751	-1.978234	-0.811025	-0.543636	-1.056283	-0.760347
1	0.099398	0.355660	1.211701	0.738471	0.195636	1.231812	-0.037870
2	-1.912569	1.550731	1.095861	0.853338	0.529918	0.922433	0.886718
3	-1.953293	-0.835806	0.227012	-0.691733	-1.561579	-0.399516	-0.891322
4	-1.419799	1.690953	-1.899709	-0.839137	-0.319077	0.325408	-0.029369

The full diagnostic file makes the omitted cause visible to us as notebook authors. Later, when we hide latent_demand, we can explain exactly why a fully observed DAG assumption is no longer valid.

Nonstationary Dataset

The nonstationary dataset adds an environment label. Each environment changes some root distributions and one mechanism strength. This gives later CD-NOD and stability notebooks a controlled example where pooling all rows hides important regime differences.

nonstationary_labels = {
    **base_labels,
    "environment": "Environment",
}
nonstationary_positions = {
    **base_positions,
    "environment": (0.34, 0.88),
}
nonstationary_node_colors = {
    **base_node_colors,
    "environment": "#fef3c7",
}
nonstationary_edge_table = pd.concat(
    [
        base_edge_table,
        pd.DataFrame(
            [
                {
                    "source": "environment",
                    "target": "need",
                    "edge_type": "directed",
                    "mechanism": "The root need distribution shifts by environment.",
                },
                {
                    "source": "environment",
                    "target": "intent",
                    "edge_type": "directed",
                    "mechanism": "The root intent distribution shifts by environment.",
                },
                {
                    "source": "environment",
                    "target": "match",
                    "edge_type": "directed",
                    "mechanism": "The intent-to-match mechanism changes by environment.",
                },
            ]
        ),
    ],
    ignore_index=True,
)
nonstationary_edge_radii = {
    **base_edge_radii,
    ("environment", "need"): 0.05,
    ("environment", "intent"): 0.16,
    ("environment", "match"): -0.04,
}

nonstationary_dag_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nonstationary_true_dag.png"
draw_teaching_style_dag(
    nonstationary_edge_table,
    nonstationary_labels,
    nonstationary_positions,
    nonstationary_node_colors,
    "Nonstationary Teaching DAG",
    nonstationary_dag_path,
    edge_radii=nonstationary_edge_radii,
)

nonstationary_edge_table.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_nonstationary_true_edges.csv", index=False)
nonstationary_edge_table.tail(5)

	source	target	edge_type	mechanism
4	engagement	renewal	directed	Engagement contributes to renewal value.
5	engagement	support	directed	Engagement creates more chances for support contact.
6	environment	need	directed	The root need distribution shifts by environment.
7	environment	intent	directed	The root intent distribution shifts by environment.
8	environment	match	directed	The intent-to-match mechanism changes by environment.

The environment node is drawn as an observed context. It is not an outcome we are trying to explain; it marks regimes where data distributions and mechanisms can change.

Generate The Nonstationary Values

This cell creates three environments with different root means and different intent -> match strengths. The graph among product variables remains recognizable, but the data distribution is no longer exchangeable across all rows.

def simulate_nonstationary(n_rows, seed):
    """Generate continuous data across multiple environments with shifting mechanisms."""
    local_rng = np.random.default_rng(seed)
    environment = np.tile(np.array([0, 1, 2]), int(np.ceil(n_rows / 3)))[:n_rows]
    local_rng.shuffle(environment)

    need_mean = np.choose(environment, [-0.45, 0.00, 0.45])
    intent_mean = np.choose(environment, [0.35, 0.00, -0.35])
    intent_to_match = np.choose(environment, [0.65, 0.90, 1.15])

    need = standardize(need_mean + draw_noise(local_rng, n_rows, "normal", scale=0.95))
    intent = standardize(intent_mean + draw_noise(local_rng, n_rows, "normal", scale=0.95))
    match = standardize(
        0.80 * need
        + intent_to_match * intent
        + draw_noise(local_rng, n_rows, "normal", scale=0.70)
    )
    engagement = standardize(1.05 * match + draw_noise(local_rng, n_rows, "normal", scale=0.75))
    renewal = standardize(0.70 * intent + 0.55 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.80))
    support = standardize(0.65 * engagement + draw_noise(local_rng, n_rows, "normal", scale=0.90))

    return pd.DataFrame(
        {
            "environment": environment,
            "need": need,
            "intent": intent,
            "match": match,
            "engagement": engagement,
            "renewal": renewal,
            "support": support,
        }
    )

nonstationary_continuous = simulate_nonstationary(N_ROWS, RANDOM_SEED + 6)
nonstationary_continuous.head()

	environment	need	intent	match	engagement	renewal	support
0	2	0.500619	0.652311	0.607523	1.021798	1.464456	-0.695260
1	2	-0.068357	-0.635389	-0.603599	0.376130	-0.156033	-0.285916
2	1	-0.257764	-3.403817	-2.065930	-2.199955	-2.253464	-1.407673
3	0	-0.050100	0.439419	0.707436	0.980496	1.322518	0.062189
4	2	0.562090	-1.236711	0.163411	0.369901	-0.825164	-0.460469

The first rows look ordinary, but the environment column tells us they come from different regimes. Later notebooks can compare pooled discovery against environment-aware diagnostics.

Save All Datasets

Now we save each dataset and its matching true edge table. The observed hidden-confounder dataset uses the base observed edges as the visible structural reference, while the full hidden edge table is saved separately for diagnostics.

scenario_notes = {
    "linear_gaussian": [
        {"note_type": "assumption", "note": "Linear additive structural equations with Gaussian noise."},
        {"note_type": "intended_use", "note": "Friendly baseline for Fisher-Z PC and score-based search."},
    ],
    "linear_nongaussian": [
        {"note_type": "assumption", "note": "Linear additive structural equations with Laplace and Student-t noise."},
        {"note_type": "intended_use", "note": "Useful for LiNGAM-style non-Gaussian direction examples."},
    ],
    "nonlinear_continuous": [
        {"note_type": "assumption", "note": "Nonlinear mechanisms with interactions, but the same parent sets as the base DAG."},
        {"note_type": "intended_use", "note": "Useful for kernel tests and nonlinear method cautions."},
    ],
    "discrete_mixed": [
        {"note_type": "assumption", "note": "Binary and ordinal values generated from latent logits."},
        {"note_type": "intended_use", "note": "Useful for discrete-data tests and mixed-data caveats."},
    ],
    "hidden_confounder_observed": [
        {"note_type": "assumption", "note": "Observed file omits latent_demand, which affects match and renewal."},
        {"note_type": "intended_use", "note": "Useful for FCI/PAG hidden-confounding examples."},
    ],
    "hidden_confounder_full": [
        {"note_type": "assumption", "note": "Diagnostic file includes latent_demand so the hidden-confounding design can be verified."},
        {"note_type": "intended_use", "note": "Do not use as the observed discovery input unless teaching oracle access."},
    ],
    "nonstationary_continuous": [
        {"note_type": "assumption", "note": "Environment changes root distributions and one mechanism strength."},
        {"note_type": "intended_use", "note": "Useful for CD-NOD and environment-stability examples."},
    ],
}

saved_files = []
saved_files.append(save_dataset("linear_gaussian", linear_gaussian, base_edge_table, scenario_notes["linear_gaussian"]))
saved_files.append(save_dataset("linear_nongaussian", linear_nongaussian, base_edge_table, scenario_notes["linear_nongaussian"]))
saved_files.append(save_dataset("nonlinear_continuous", nonlinear_continuous, base_edge_table, scenario_notes["nonlinear_continuous"]))
saved_files.append(save_dataset("discrete_mixed", discrete_mixed, base_edge_table, scenario_notes["discrete_mixed"]))
saved_files.append(save_dataset("hidden_confounder_observed", hidden_confounder_observed, base_edge_table, scenario_notes["hidden_confounder_observed"]))
saved_files.append(save_dataset("hidden_confounder_full", hidden_confounder_full, hidden_edge_table, scenario_notes["hidden_confounder_full"]))
saved_files.append(save_dataset("nonstationary_continuous", nonstationary_continuous, nonstationary_edge_table, scenario_notes["nonstationary_continuous"]))

saved_file_table = pd.DataFrame(saved_files)
saved_file_table.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_saved_dataset_files.csv", index=False)
saved_file_table

	dataset_name	data_path	edge_path	notes_path	rows	columns
0	linear_gaussian	outputs/datasets/02_linear_gaussian.csv	outputs/tables/02_linear_gaussian_true_edges.csv	outputs/tables/02_linear_gaussian_notes.csv	2500	6
1	linear_nongaussian	outputs/datasets/02_linear_nongaussian.csv	outputs/tables/02_linear_nongaussian_true_edges.csv	outputs/tables/02_linear_nongaussian_notes.csv	2500	6
2	nonlinear_continuous	outputs/datasets/02_nonlinear_continuous.csv	outputs/tables/02_nonlinear_continuous_true_edges.csv	outputs/tables/02_nonlinear_continuous_notes.csv	2500	6
3	discrete_mixed	outputs/datasets/02_discrete_mixed.csv	outputs/tables/02_discrete_mixed_true_edges.csv	outputs/tables/02_discrete_mixed_notes.csv	2500	6
4	hidden_confounder_observed	outputs/datasets/02_hidden_confounder_observed.csv	outputs/tables/02_hidden_confounder_observed_true_edges.csv	outputs/tables/02_hidden_confounder_observed_notes.csv	2500	6
5	hidden_confounder_full	outputs/datasets/02_hidden_confounder_full.csv	outputs/tables/02_hidden_confounder_full_true_edges.csv	outputs/tables/02_hidden_confounder_full_notes.csv	2500	7
6	nonstationary_continuous	outputs/datasets/02_nonstationary_continuous.csv	outputs/tables/02_nonstationary_continuous_true_edges.csv	outputs/tables/02_nonstationary_continuous_notes.csv	2500	7

The saved file table is what downstream notebooks should use. The distinction between hidden_confounder_observed and hidden_confounder_full is important: discovery should use the observed file, while teaching diagnostics can use the full file.

Shape And Missingness Checks

Generated data should still be audited. The next cell checks shape, column order, and missingness for every saved dataset. A synthetic generator that silently creates missing values or inconsistent columns would make later algorithm behavior harder to explain.

datasets = {
    "linear_gaussian": linear_gaussian,
    "linear_nongaussian": linear_nongaussian,
    "nonlinear_continuous": nonlinear_continuous,
    "discrete_mixed": discrete_mixed,
    "hidden_confounder_observed": hidden_confounder_observed,
    "hidden_confounder_full": hidden_confounder_full,
    "nonstationary_continuous": nonstationary_continuous,
}

shape_rows = []
for name, data in datasets.items():
    shape_rows.append(
        {
            "dataset_name": name,
            "rows": data.shape[0],
            "columns": data.shape[1],
            "column_list": ", ".join(data.columns),
            "total_missing_values": int(data.isna().sum().sum()),
        }
    )

shape_check = pd.DataFrame(shape_rows)
shape_check.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_shape_and_missingness_checks.csv", index=False)
shape_check

	dataset_name	rows	columns	column_list
0	linear_gaussian	2500	6	need, intent, match, engagement, renewal, support
1	linear_nongaussian	2500	6	need, intent, match, engagement, renewal, support
2	nonlinear_continuous	2500	6	need, intent, match, engagement, renewal, support
3	discrete_mixed	2500	6	need, intent, match, engagement, renewal, support
4	hidden_confounder_observed	2500	6	need, intent, match, engagement, renewal, support
5	hidden_confounder_full	2500	7	need, intent, latent_demand, match, engagement, renewal, support
6	nonstationary_continuous	2500	7	environment, need, intent, match, engagement, renewal, support

All generated datasets should have the intended number of rows and no missing values. This keeps later discovery behavior focused on causal assumptions rather than data-cleaning artifacts.

Continuous Summary Statistics

This summary checks whether the continuous datasets are roughly centered and scaled. Standardization does not make all scenarios identical: non-Gaussian and nonlinear datasets can still differ in skew, kurtosis, and dependence patterns.

continuous_dataset_names = [
    "linear_gaussian",
    "linear_nongaussian",
    "nonlinear_continuous",
    "hidden_confounder_observed",
    "nonstationary_continuous",
]

summary_rows = []
for name in continuous_dataset_names:
    data = datasets[name].drop(columns=["environment"], errors="ignore")
    for column in base_nodes:
        series = data[column]
        summary_rows.append(
            {
                "dataset_name": name,
                "variable": column,
                "mean": series.mean(),
                "std": series.std(ddof=0),
                "min": series.min(),
                "median": series.median(),
                "max": series.max(),
                "skew": series.skew(),
                "kurtosis": series.kurtosis(),
            }
        )

continuous_summary = pd.DataFrame(summary_rows)
continuous_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_continuous_summary_statistics.csv", index=False)
continuous_summary.head(12)

	dataset_name	variable	mean	std	min	median	max	skew	kurtosis
0	linear_gaussian	need	7.105427e-18	1.0	-3.371900	0.019662	3.316358	-0.010639	0.003971
1	linear_gaussian	intent	-1.136868e-17	1.0	-3.467962	-0.012950	3.389982	0.020848	-0.077626
2	linear_gaussian	match	-2.131628e-18	1.0	-3.596272	0.001127	3.917631	-0.027569	0.014416
3	linear_gaussian	engagement	-6.394885e-18	1.0	-3.404627	0.004449	3.469112	-0.032966	0.018805
4	linear_gaussian	renewal	-7.105427e-18	1.0	-3.418065	-0.024579	3.122549	0.111052	-0.119672
5	linear_gaussian	support	6.394885e-18	1.0	-3.250120	-0.007028	3.174916	-0.022463	-0.203720
6	linear_nongaussian	need	1.421085e-17	1.0	-6.499779	0.007874	6.786758	-0.064023	3.522823
7	linear_nongaussian	intent	1.136868e-17	1.0	-7.262542	0.016406	6.225672	-0.244536	5.495513
8	linear_nongaussian	match	1.421085e-17	1.0	-5.379496	0.008311	4.600044	-0.154743	1.669282
9	linear_nongaussian	engagement	5.684342e-18	1.0	-4.766194	0.006455	5.646124	0.020379	1.669102
10	linear_nongaussian	renewal	5.684342e-18	1.0	-6.663709	0.021946	4.560687	-0.165841	2.211792
11	linear_nongaussian	support	8.526513e-18	1.0	-4.489456	-0.004856	4.980148	0.184690	1.935507

The means are close to zero and standard deviations are close to one because of standardization. The skew and kurtosis columns are more revealing: they help distinguish Gaussian-style data from heavier-tailed or nonlinear scenarios.

Distribution Shape Comparison

The next plot compares the marginal distribution of renewal across the continuous scenarios. This is a quick visual reminder that datasets can share a graph while having different noise and functional assumptions.

density_plot_df = pd.concat(
    [
        datasets[name].assign(dataset_name=name)[["dataset_name", "renewal"]]
        for name in continuous_dataset_names
    ],
    ignore_index=True,
)

fig, ax = plt.subplots(figsize=(11, 5))
sns.kdeplot(
    data=density_plot_df,
    x="renewal",
    hue="dataset_name",
    common_norm=False,
    linewidth=1.6,
    ax=ax,
)
ax.set_title("Renewal Distribution Across Continuous Synthetic Datasets")
ax.set_xlabel("Standardized renewal")
ax.set_ylabel("Density")
plt.tight_layout()
renewal_density_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_renewal_distribution_comparison.png"
fig.savefig(renewal_density_path, dpi=160, bbox_inches="tight")
plt.show()

The density curves show that the same variable can have different distributional behavior across scenarios. This matters because some discovery methods are designed for Gaussian data, while others rely on non-Gaussianity or nonlinear dependence.

Parent-Child Signal Checks

A good teaching dataset should contain detectable signal along true edges. This cell computes correlations for every true parent-child pair in the base DAG and compares them with a few non-edge pairs. Correlation is not a causal proof, but it is a useful generator sanity check.

non_edge_pairs = [
    ("need", "support"),
    ("intent", "support"),
    ("need", "engagement"),
    ("match", "renewal"),
]

signal_rows = []
for name in ["linear_gaussian", "linear_nongaussian", "nonlinear_continuous", "hidden_confounder_observed"]:
    data = datasets[name]
    for row in base_edge_table.itertuples(index=False):
        signal_rows.append(
            {
                "dataset_name": name,
                "pair_type": "true_edge",
                "source": row.source,
                "target": row.target,
                "correlation": data[row.source].corr(data[row.target]),
            }
        )
    for source, target in non_edge_pairs:
        signal_rows.append(
            {
                "dataset_name": name,
                "pair_type": "selected_non_edge",
                "source": source,
                "target": target,
                "correlation": data[source].corr(data[target]),
            }
        )

edge_signal_checks = pd.DataFrame(signal_rows)
edge_signal_checks.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_edge_signal_checks.csv", index=False)
edge_signal_checks.head(16)

	dataset_name	pair_type	source	target	correlation
0	linear_gaussian	true_edge	need	match	0.595021
1	linear_gaussian	true_edge	intent	match	0.638324
2	linear_gaussian	true_edge	match	engagement	0.820921
3	linear_gaussian	true_edge	intent	renewal	0.735233
4	linear_gaussian	true_edge	engagement	renewal	0.670390
5	linear_gaussian	true_edge	engagement	support	0.582085
6	linear_gaussian	selected_non_edge	need	support	0.311135
7	linear_gaussian	selected_non_edge	intent	support	0.307245
8	linear_gaussian	selected_non_edge	need	engagement	0.493543
9	linear_gaussian	selected_non_edge	match	renewal	0.665300
10	linear_nongaussian	true_edge	need	match	0.557156
11	linear_nongaussian	true_edge	intent	match	0.636224
12	linear_nongaussian	true_edge	match	engagement	0.831184
13	linear_nongaussian	true_edge	intent	renewal	0.732469
14	linear_nongaussian	true_edge	engagement	renewal	0.688297
15	linear_nongaussian	true_edge	engagement	support	0.596188

True edges generally show clear association, but some selected non-edges can also be associated through indirect paths. This is exactly why causal discovery needs conditional-independence logic rather than simple pairwise correlation alone.

Correlation Heatmaps

The heatmaps give a compact view of pairwise dependence across scenarios. They are not graph estimates. They are diagnostic maps that help us see whether the generated data contain the broad dependence patterns implied by the structural equations.

heatmap_names = ["linear_gaussian", "linear_nongaussian", "nonlinear_continuous", "hidden_confounder_observed"]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for ax, name in zip(axes, heatmap_names):
    corr = datasets[name][base_nodes].corr()
    sns.heatmap(
        corr,
        vmin=-1,
        vmax=1,
        center=0,
        cmap="vlag",
        square=True,
        linewidths=0.5,
        cbar=name == heatmap_names[-1],
        ax=ax,
    )
    ax.set_title(name.replace("_", " ").title())
    ax.tick_params(axis="x", rotation=35)
    ax.tick_params(axis="y", rotation=0)

plt.tight_layout()
correlation_heatmap_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_scenario_correlation_heatmaps.png"
fig.savefig(correlation_heatmap_path, dpi=160, bbox_inches="tight")
plt.show()

The heatmaps show strong dependence blocks around match, engagement, and downstream outcomes. The hidden-confounder version can look similar to the base data in pairwise correlations, which is why hidden confounding is hard to rule out from simple summaries.

Discrete Dataset Checks

For the discrete dataset, means are easier to read as rates or average ordinal levels. This cell reports value counts and rates so later notebooks know what class balance they are working with.

discrete_rate_rows = []
for column in discrete_mixed.columns:
    value_counts = discrete_mixed[column].value_counts(normalize=True).sort_index()
    for value, share in value_counts.items():
        discrete_rate_rows.append(
            {
                "variable": column,
                "value": int(value),
                "share": share,
                "count": int((discrete_mixed[column] == value).sum()),
            }
        )

discrete_value_rates = pd.DataFrame(discrete_rate_rows)
discrete_value_rates.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_discrete_value_rates.csv", index=False)
discrete_value_rates

	variable	value	share	count
0	need	0	0.5076	1269
1	need	1	0.4924	1231
2	intent	0	0.4960	1240
3	intent	1	0.5040	1260
4	match	0	0.4160	1040
5	match	1	0.5840	1460
6	engagement	0	0.2992	748
7	engagement	1	0.4124	1031
8	engagement	2	0.2884	721
9	renewal	0	0.4692	1173
10	renewal	1	0.5308	1327
11	support	0	0.6188	1547
12	support	1	0.3812	953

The classes are not perfectly balanced, which is intentional. Discovery examples with discrete data should include realistic imbalance, but not so much imbalance that the teaching signal disappears.

Hidden-Confounder Diagnostics

Because we generated the hidden variable ourselves, we can check how strongly it relates to the observed variables it affects. A real dataset would not give us this luxury; that is exactly why this synthetic scenario is useful for teaching.

hidden_diagnostics = pd.DataFrame(
    [
        {
            "relationship": "latent_demand with match",
            "correlation": hidden_confounder_full["latent_demand"].corr(hidden_confounder_full["match"]),
            "why_it_matters": "Latent demand partly drives observed match quality.",
        },
        {
            "relationship": "latent_demand with renewal",
            "correlation": hidden_confounder_full["latent_demand"].corr(hidden_confounder_full["renewal"]),
            "why_it_matters": "Latent demand also drives renewal, creating unobserved common-cause risk.",
        },
        {
            "relationship": "match with renewal in observed file",
            "correlation": hidden_confounder_observed["match"].corr(hidden_confounder_observed["renewal"]),
            "why_it_matters": "Observed association may mix directed, indirect, and hidden-confounding paths.",
        },
    ]
)

hidden_diagnostics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_hidden_confounder_diagnostics.csv", index=False)
hidden_diagnostics

	relationship	correlation	why_it_matters
0	latent_demand with match	0.508012	Latent demand partly drives observed match quality.
1	latent_demand with renewal	0.580324	Latent demand also drives renewal, creating unobserved common-cause risk.
2	match with renewal in observed file	0.733291	Observed association may mix directed, indirect, and hidden-confounding paths.

The latent variable is correlated with both match and renewal, as designed. Later, FCI-style methods can use the observed file to demonstrate why a PAG may be more honest than forcing a fully observed DAG.

Nonstationarity Diagnostics

The nonstationary dataset should show environment-level differences. This cell summarizes means and standard deviations by environment, then plots the main mechanism shift we designed into the data.

nonstationary_summary = (
    nonstationary_continuous
    .groupby("environment")
    .agg(
        rows=("environment", "size"),
        need_mean=("need", "mean"),
        intent_mean=("intent", "mean"),
        match_mean=("match", "mean"),
        renewal_mean=("renewal", "mean"),
        support_mean=("support", "mean"),
    )
    .reset_index()
)

nonstationary_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_nonstationary_environment_summary.csv", index=False)
nonstationary_summary

	environment	rows	need_mean	intent_mean	match_mean	renewal_mean	support_mean
0	0	834	-0.419768	0.334375	-0.048747	0.109419	-0.025874
1	1	833	-0.047202	0.027199	0.044692	0.060080	0.037340
2	2	833	0.467474	-0.361976	0.004113	-0.169630	-0.011435

The environment means differ because we intentionally changed root distributions and mechanism strength. That makes this dataset useful for tutorials about stationarity assumptions and environment-aware discovery.

Visualize The Environment Shift

This scatterplot shows intent against match by environment. The relationship is not identical across regimes, which is the mechanism shift built into the generator.

plot_sample = nonstationary_continuous.sample(n=900, random_state=RANDOM_SEED)
fig, ax = plt.subplots(figsize=(9, 6))
sns.scatterplot(
    data=plot_sample,
    x="intent",
    y="match",
    hue="environment",
    palette="viridis",
    alpha=0.65,
    s=28,
    ax=ax,
)
sns.regplot(
    data=plot_sample,
    x="intent",
    y="match",
    scatter=False,
    color="#111827",
    line_kws={"linewidth": 1.4, "linestyle": "--"},
    ax=ax,
)
ax.set_title("Intent-To-Match Relationship Across Environments")
ax.set_xlabel("Intent signal")
ax.set_ylabel("Match quality")
plt.tight_layout()
nonstationary_shift_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nonstationary_intent_match_shift.png"
fig.savefig(nonstationary_shift_path, dpi=160, bbox_inches="tight")
plt.show()

The pooled dashed line summarizes all environments at once, while the colored points show that the data are not generated from one perfectly stable regime. Later notebooks can use this to motivate environment-specific checks before trusting one pooled graph.

Scenario Comparison Table

The next table combines shape, type, and intended use into a compact catalog. It is the quickest reference for choosing the right synthetic dataset in later notebooks.

scenario_catalog = (
    dataset_registry
    .merge(shape_check[["dataset_name", "rows", "columns", "total_missing_values"]], on="dataset_name", how="left")
    .merge(saved_file_table[["dataset_name", "data_path", "edge_path"]], on="dataset_name", how="left")
)

# Add the full hidden diagnostic file, which is intentionally not in the main registry.
hidden_full_row = saved_file_table[saved_file_table["dataset_name"].eq("hidden_confounder_full")].assign(
    row_count=N_ROWS,
    variable_type="continuous with latent column",
    main_stress_test="Diagnostic oracle file for the hidden-confounder scenario.",
    use_later_for="Diagnostics only; observed discovery should omit latent_demand.",
    known_limitation="Not a realistic observed discovery input.",
    total_missing_values=0,
)
scenario_catalog = pd.concat(
    [
        scenario_catalog,
        hidden_full_row[
            [
                "dataset_name",
                "row_count",
                "variable_type",
                "main_stress_test",
                "use_later_for",
                "known_limitation",
                "rows",
                "columns",
                "total_missing_values",
                "data_path",
                "edge_path",
            ]
        ],
    ],
    ignore_index=True,
)

scenario_catalog.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_scenario_catalog.csv", index=False)
scenario_catalog

	dataset_name	row_count	variable_type	main_stress_test	use_later_for	known_limitation	rows	columns	data_path	edge_path
0	linear_gaussian	2500	continuous	Linear additive mechanisms with Gaussian noise.	PC, Fisher-Z tests, GES, baseline graph recovery.	Purely observational data may not orient every Markov-equivalent edge.	2500	6	outputs/datasets/02_linear_gaussian.csv	outputs/tables/02_linear_gaussian_true_edges.csv
1	linear_nongaussian	2500	continuous	Linear mechanisms with non-Gaussian noise.	LiNGAM-style direction learning and non-Gaussian diagnostics.	Non-Gaussianity helps only when the linear model is a reasonable approximation.	2500	6	outputs/datasets/02_linear_nongaussian.csv	outputs/tables/02_linear_nongaussian_true_edges.csv
2	nonlinear_continuous	2500	continuous	Nonlinear parent effects and interactions.	Kernel tests, nonlinear functional methods, robustness checks.	Linear tests may miss or distort nonlinear dependence.	2500	6	outputs/datasets/02_nonlinear_continuous.csv	outputs/tables/02_nonlinear_continuous_true_edges.csv
3	discrete_mixed	2500	binary and ordinal	Discrete outcomes generated from latent logits.	Discrete conditional-independence tests and mixed-data cautions.	Treating these variables as Gaussian continuous data is a modeling mismatch.	2500	6	outputs/datasets/02_discrete_mixed.csv	outputs/tables/02_discrete_mixed_true_edges.csv
4	hidden_confounder_observed	2500	continuous with one omitted cause	A latent variable affects multiple observed variables.	FCI/PAG examples and hidden-confounding sensitivity.	The observed variables alone do not satisfy causal sufficiency.	2500	6	outputs/datasets/02_hidden_confounder_observed.csv	outputs/tables/02_hidden_confounder_observed_true_edges.csv
5	nonstationary_continuous	2500	continuous plus environment label	Mechanisms and distributions shift across environments.	CD-NOD and stability diagnostics across environments.	Pooling environments can hide mechanism changes.	2500	7	outputs/datasets/02_nonstationary_continuous.csv	outputs/tables/02_nonstationary_continuous_true_edges.csv
6	hidden_confounder_full	2500	continuous with latent column	Diagnostic oracle file for the hidden-confounder scenario.	Diagnostics only; observed discovery should omit latent_demand.	Not a realistic observed discovery input.	2500	7	outputs/datasets/02_hidden_confounder_full.csv	outputs/tables/02_hidden_confounder_full_true_edges.csv

The catalog makes downstream notebook choices explicit. For example, the PC notebook should start with linear_gaussian, while the FCI notebook should use hidden_confounder_observed and explain why the full hidden file is only for diagnostics.

Generated Artifact Manifest

The final cell lists all files generated by this notebook. This is a practical audit trail: if a later notebook cannot find a file, this manifest tells us whether it was created here and where it should live.

artifact_paths = sorted(
    list(DATASET_DIR.glob(f"{NOTEBOOK_PREFIX}_*"))
    + list(TABLE_DIR.glob(f"{NOTEBOOK_PREFIX}_*"))
    + list(FIGURE_DIR.glob(f"{NOTEBOOK_PREFIX}_*"))
)

artifact_manifest = pd.DataFrame(
    [
        {
            "artifact_type": (
                "dataset" if path.parent == DATASET_DIR else "figure" if path.parent == FIGURE_DIR else "table"
            ),
            "path": str(path.relative_to(NOTEBOOK_DIR)),
            "size_kb": round(path.stat().st_size / 1024, 1),
        }
        for path in artifact_paths
    ]
)

artifact_manifest.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_artifact_manifest.csv", index=False)
artifact_manifest

	artifact_type	path	size_kb
0	dataset	outputs/datasets/02_discrete_mixed.csv	29.3
1	dataset	outputs/datasets/02_hidden_confounder_full.csv	335.4
2	dataset	outputs/datasets/02_hidden_confounder_observed.csv	287.5
3	dataset	outputs/datasets/02_linear_gaussian.csv	287.6
4	dataset	outputs/datasets/02_linear_nongaussian.csv	288.5
5	dataset	outputs/datasets/02_nonlinear_continuous.csv	288.3
6	dataset	outputs/datasets/02_nonstationary_continuous.csv	292.5
7	figure	outputs/figures/02_base_teaching_dag.png	64.6
8	figure	outputs/figures/02_hidden_confounder_true_dag.png	80.6
9	figure	outputs/figures/02_nonstationary_intent_match_shift.png	269.5
10	figure	outputs/figures/02_nonstationary_true_dag.png	80.4
11	figure	outputs/figures/02_renewal_distribution_comparison.png	146.7
12	figure	outputs/figures/02_scenario_correlation_heatmaps.png	131.7
13	table	outputs/tables/02_base_causallearn_edge_strings.csv	0.1
14	table	outputs/tables/02_base_true_adjacency_matrix.csv	0.2
15	table	outputs/tables/02_base_true_dag_edges.csv	0.4
16	table	outputs/tables/02_continuous_summary_statistics.csv	4.7
17	table	outputs/tables/02_dataset_registry.csv	1.4
18	table	outputs/tables/02_discrete_mixed_notes.csv	0.1
19	table	outputs/tables/02_discrete_mixed_true_edges.csv	0.4
20	table	outputs/tables/02_discrete_value_rates.csv	0.3
21	table	outputs/tables/02_edge_signal_checks.csv	2.7
22	table	outputs/tables/02_hidden_confounder_diagnostics.csv	0.4
23	table	outputs/tables/02_hidden_confounder_full_notes.csv	0.2
24	table	outputs/tables/02_hidden_confounder_full_true_edges.csv	0.6
25	table	outputs/tables/02_hidden_confounder_observed_notes.csv	0.2
26	table	outputs/tables/02_hidden_confounder_observed_true_edges.csv	0.4
27	table	outputs/tables/02_linear_gaussian_notes.csv	0.2
28	table	outputs/tables/02_linear_gaussian_true_edges.csv	0.4
29	table	outputs/tables/02_linear_nongaussian_notes.csv	0.2
30	table	outputs/tables/02_linear_nongaussian_true_edges.csv	0.4
31	table	outputs/tables/02_nonlinear_continuous_notes.csv	0.2
32	table	outputs/tables/02_nonlinear_continuous_true_edges.csv	0.4
33	table	outputs/tables/02_nonstationary_continuous_notes.csv	0.2
34	table	outputs/tables/02_nonstationary_continuous_true_edges.csv	0.7
35	table	outputs/tables/02_nonstationary_environment_summary.csv	0.4
36	table	outputs/tables/02_nonstationary_true_edges.csv	0.7
37	table	outputs/tables/02_saved_dataset_files.csv	1.3
38	table	outputs/tables/02_scenario_catalog.csv	2.4
39	table	outputs/tables/02_shape_and_missingness_checks.csv	0.6
40	table	outputs/tables/02_variable_dictionary.csv	0.9

The synthetic data factory is now ready. The next tutorial can focus on conditional-independence tests because it can load known datasets from outputs/datasets and compare test behavior against documented ground truth.