causal-learn Tutorial 08: Score-Based Discovery With GES

The earlier causal-learn notebooks focused mostly on constraint-based discovery: PC, FCI, and CD-NOD search for graphs by asking many conditional independence questions. This notebook switches to a different family: score-based discovery.

Score-based discovery assigns each candidate graph a numerical score and searches for a graph that improves that score. The central algorithm here is GES, or Greedy Equivalence Search. GES does not test one conditional independence statement at a time. Instead, it searches over graph structures, adding and deleting edges when those operations improve a decomposable score such as BIC or BDeu.

The practical goals are:

understand what GES is optimizing;
run GES with a BIC-style score on continuous data;
compare GES with PC on the same synthetic graph;
study how penalty strength, sample size, and model mismatch affect score-based discovery;
run a small BDeu example for discrete data;
practice careful reporting for score-based graph search.

Runtime Note

This notebook is designed to execute quickly by default, usually in about 1-3 minutes on the current dataset and environment. It runs BIC-style GES and a small BDeu example.

causal-learn also exposes kernel/RKHS-style scores such as local_score_CV_general and local_score_marginal_general. Those can be useful for nonlinear settings, but on this 2,500-row teaching dataset they can take 5-15+ minutes depending on the machine and may emit numerical warnings. For a tutorial notebook, that is too much waiting for the main path, so those slower score families are discussed as optional extensions rather than executed by default.

Notebook Flow

We will work through score-based discovery in this order:

Set up imports, paths, GES, PC, and reusable graph helpers.
Load the synthetic datasets created earlier in the tutorial series.
Explain GES, BIC, BDeu, and the forward/backward greedy search idea.
Run BIC-GES on the linear Gaussian dataset.
Compare GES with PC on the same data.
Inspect GES update history, penalty sensitivity, and sample-size sensitivity.
Stress-test BIC-GES on non-Gaussian and nonlinear data.
Run BDeu-GES on the discrete teaching dataset.
Save a reporting checklist and artifact manifest.

As in the previous notebooks, every code cell has explanation before it and a short discussion after it.

GES Theory

GES, or Greedy Equivalence Search, is a score-based causal discovery algorithm. Instead of asking many conditional independence questions directly, it searches for a graph that optimizes a numerical score.

The score typically rewards goodness of fit and penalizes complexity. For continuous linear Gaussian data, BIC-style scores are common. A graph with more parents can fit the data better, but the penalty discourages adding edges that do not improve the model enough.

GES is greedy: it follows local score-improving moves. This makes it much more scalable than exact search, but it also means the path through graph space matters.

Decomposable Scores

GES works efficiently because many graph scores are decomposable. A decomposable score can be written as a sum of local node scores:

\[ S(G) = \sum_i S_i(\mathrm{Pa}(X_i)) \]

This means the contribution for X_i depends only on the parent set of X_i. If a proposed move changes only the parents of one node, the algorithm can update only that local score rather than refitting the entire graph from scratch.

Decomposability is one reason score-based discovery can be practical. It is also why parent-set assumptions and score choice matter so much.

BIC And Complexity Penalties

BIC balances model fit against model complexity. In causal discovery, adding an edge usually adds a parameter because the child variable gets another parent. More parameters can improve fit, but they can also overfit sample noise.

A BIC-style score therefore asks: does the improvement in likelihood justify the extra complexity? If yes, the edge may be added. If no, the penalty wins and the simpler graph is preferred.

The penalty strength is not just a technicality. Stronger penalties produce sparser graphs; weaker penalties allow denser graphs. This is why the GES notebook includes penalty sensitivity.

Equivalence Classes Rather Than Single DAGs

GES searches over Markov equivalence classes rather than treating every DAG orientation as uniquely identifiable. Multiple DAGs can imply the same conditional independence structure and receive equivalent scores under common assumptions.

The output is often represented as a CPDAG-like graph. Directed edges are compelled across the equivalence class; undirected edges are not uniquely oriented by the score and assumptions.

This is conceptually similar to PC’s CPDAG output. Both methods can leave edges unoriented when observational information does not compel a direction.

Forward And Backward Phases

GES usually has two main phases.

The forward phase starts from an empty graph and greedily adds edges or edge patterns that improve the score. This phase increases model complexity when the fit improvement is worthwhile.

The backward phase starts from the forward result and greedily deletes edges when removal improves the score. This phase corrects some over-addition from the forward phase and searches for a better sparse explanation.

Some implementations also include turning or refinement steps. The key idea remains the same: local moves are accepted when they improve the score.

Score Assumptions And Model Mismatch

A score is not neutral. A linear Gaussian BIC score is most appropriate when the data are reasonably modeled by linear relationships with Gaussian residuals. If the true mechanisms are nonlinear, discrete, heavy-tailed, or heteroskedastic, the score may prefer a graph that is statistically convenient rather than causally correct.

This is why a score-based workflow should include mechanism checks and alternative scores when appropriate. For discrete data, a BDeu-style score may be more natural than a continuous BIC score.

The graph should be interpreted as the best explanation under the chosen score, penalty, and search procedure, not as a score-free causal truth.

What GES Can And Cannot Claim

GES can efficiently search large graph spaces and often performs well when its score assumptions match the data. It is useful as a complement to PC because it approaches discovery through model fit rather than direct independence testing.

GES cannot guarantee a global optimum. It is greedy, so local decisions can affect the final graph. It also cannot identify directions that are not identifiable under the observational equivalence class without additional assumptions.

A good GES report documents the score, penalty, data type, search settings, sample-size sensitivity, and any comparison to constraint-based or exact-search results.

Setup

This setup cell imports the scientific stack, GES, PC, and plotting utilities. It also applies a small local compatibility wrapper for causal-learn’s BIC score in this Python/NumPy environment. The wrapper fixes scalar extraction from a one-element matrix and does not edit the installed package on disk.

from pathlib import Path
from importlib.metadata import PackageNotFoundError, version
import contextlib
import io
import os
import warnings

# Keep matplotlib cache writes inside the repository so notebook execution works in restricted environments.
os.environ.setdefault("MPLCONFIGDIR", str(Path.cwd() / ".matplotlib_cache"))
warnings.filterwarnings("ignore", message="IProgress not found.*")
warnings.filterwarnings("ignore", message=".*pkg_resources is deprecated.*")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from matplotlib.patches import FancyArrowPatch, FancyBboxPatch

import causallearn.search.ScoreBased.GES as ges_module
from causallearn.search.ScoreBased.GES import ges
from causallearn.search.ConstraintBased.PC import pc

# Resolve paths whether the notebook is run from the repository root or from this notebook folder.
CWD = Path.cwd()
if CWD.name == "causal_learn" and (CWD / "outputs").exists():
    NOTEBOOK_DIR = CWD
else:
    NOTEBOOK_DIR = (CWD / "notebooks" / "tutorials" / "causal_learn").resolve()

OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
TABLE_DIR = OUTPUT_DIR / "tables"
FIGURE_DIR = OUTPUT_DIR / "figures"
for directory in [OUTPUT_DIR, DATASET_DIR, TABLE_DIR, FIGURE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

NOTEBOOK_PREFIX = "08"
sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.dpi"] = 120
plt.rcParams["savefig.facecolor"] = "white"


def local_score_BIC_from_cov(Data, i, PAi, parameters=None):
    """Compatibility wrapper for causal-learn BIC score under recent NumPy scalar behavior."""
    cov, n = Data
    lambda_value = 0.5 if parameters is None else parameters.get("lambda_value", 0.5)
    sigma = float(cov[i, i])
    if len(PAi) > 0:
        yX = cov[np.ix_([i], PAi)]
        XX = cov[np.ix_(PAi, PAi)]
        try:
            XX_inv = np.linalg.inv(XX)
        except np.linalg.LinAlgError:
            XX_inv = np.linalg.pinv(XX)
        sigma = float(cov[i, i] - (yX @ XX_inv @ yX.T).item())
    sigma = max(sigma, 1e-12)
    return -0.5 * n * (1 + np.log(sigma)) - lambda_value * (len(PAi) + 1) * np.log(n)

# Patch only the imported module object used during this notebook session.
ges_module.local_score_BIC_from_cov = local_score_BIC_from_cov

packages = ["causal-learn", "numpy", "pandas", "matplotlib", "seaborn"]
version_rows = []
for package in packages:
    try:
        package_version = version(package)
    except PackageNotFoundError:
        package_version = "not installed"
    version_rows.append({"package": package, "version": package_version})

package_versions = pd.DataFrame(version_rows)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)

	package	version
0	causal-learn	0.1.4.5
1	numpy	2.4.4
2	pandas	3.0.2
3	matplotlib	3.10.9
4	seaborn	0.13.2

The version table and compatibility wrapper make the notebook reproducible in this environment. The wrapper preserves the same BIC formula; it only converts a one-element NumPy result into a Python scalar before scoring.

Load Teaching Datasets

We use four datasets from the synthetic data factory:

02_linear_gaussian.csv: the best match for BIC-GES because the score assumes linear Gaussian local regressions;
02_linear_nongaussian.csv: same broad graph family but with non-Gaussian noise;
02_nonlinear_continuous.csv: a deliberate mismatch for linear Gaussian BIC;
02_discrete_mixed.csv: a small discrete example for BDeu-GES.

The true edge tables are loaded only for tutorial evaluation. Real causal discovery work would not know the true graph.

# Load datasets and true edge lists created by tutorial notebook 02.
dataset_paths = {
    "linear_gaussian": DATASET_DIR / "02_linear_gaussian.csv",
    "linear_nongaussian": DATASET_DIR / "02_linear_nongaussian.csv",
    "nonlinear_continuous": DATASET_DIR / "02_nonlinear_continuous.csv",
    "discrete_mixed": DATASET_DIR / "02_discrete_mixed.csv",
}
true_edge_path = TABLE_DIR / "02_base_true_dag_edges.csv"
discrete_true_edge_path = TABLE_DIR / "02_discrete_mixed_true_edges.csv"

required_paths = list(dataset_paths.values()) + [true_edge_path, discrete_true_edge_path]
missing_paths = [str(path) for path in required_paths if not path.exists()]
if missing_paths:
    raise FileNotFoundError("Run tutorial notebook 02 first. Missing files: " + ", ".join(missing_paths))

datasets = {name: pd.read_csv(path) for name, path in dataset_paths.items()}
true_edges = pd.read_csv(true_edge_path)
discrete_true_edges = pd.read_csv(discrete_true_edge_path)
VARIABLES = list(datasets["linear_gaussian"].columns)

loaded_summary = pd.DataFrame(
    [
        {
            "dataset": name,
            "rows": len(df),
            "columns": df.shape[1],
            "missing_cells": int(df.isna().sum().sum()),
            "source_file": dataset_paths[name].name,
        }
        for name, df in datasets.items()
    ]
)
loaded_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_loaded_dataset_summary.csv", index=False)

display(loaded_summary)
display(datasets["linear_gaussian"].head())
display(true_edges)

	dataset	rows	columns	source_file
0	linear_gaussian	2500	6	02_linear_gaussian.csv
1	linear_nongaussian	2500	6	02_linear_nongaussian.csv
2	nonlinear_continuous	2500	6	02_nonlinear_continuous.csv
3	discrete_mixed	2500	6	02_discrete_mixed.csv

	need	intent	match	engagement	renewal	support
0	0.249820	-0.372094	0.060245	0.667197	0.252766	-0.280058
1	0.683671	-0.210471	0.904969	1.004727	0.320095	-0.332215
2	-0.579752	-1.202671	-0.578579	-0.235444	-0.732431	0.594102
3	-0.902823	-0.077309	-0.771219	-0.531128	-0.105721	-1.503551
4	-1.985745	0.087297	-0.691315	-1.281731	-0.797906	-0.328219

	source	target	edge_type	mechanism
0	need	match	directed	Need changes what a good match means.
1	intent	match	directed	Current intent changes recommendation relevance.
2	match	engagement	directed	Better matching increases engagement depth.
3	intent	renewal	directed	Intent directly affects later value.
4	engagement	renewal	directed	Engagement contributes to renewal value.
5	engagement	support	directed	Engagement creates more chances for support co...

All datasets use the same six observed variables, which makes graph comparisons straightforward. The first run will use linear_gaussian because BIC-GES is best aligned with that data-generating setup.

GES Concept Map

GES searches over equivalence classes of DAGs rather than testing individual conditional independence statements. It usually has two major stages: a forward stage that greedily inserts edges to improve the score, and a backward stage that deletes edges if doing so improves the score.

The output should be read as a candidate equivalence-class graph. Some directions may be compelled, while other adjacencies may remain unoriented because several DAGs can have the same score under the assumptions.

# Summarize the key score-based discovery concepts used in this notebook.
ges_concepts = pd.DataFrame(
    [
        {
            "concept": "score-based discovery",
            "plain_language": "Search for a graph that optimizes a numerical fit-plus-penalty score.",
            "why_it_matters": "The result depends on score assumptions and search choices, not only independence tests.",
        },
        {
            "concept": "GES forward phase",
            "plain_language": "Greedily insert edges or operators that improve the graph score.",
            "why_it_matters": "Early additions can shape the graph neighborhood explored later.",
        },
        {
            "concept": "GES backward phase",
            "plain_language": "Delete edges or operators when deletion improves the score.",
            "why_it_matters": "The backward phase can simplify an overly dense forward graph.",
        },
        {
            "concept": "BIC score",
            "plain_language": "Fit a linear Gaussian local model and penalize parent-set complexity.",
            "why_it_matters": "Good match for linear Gaussian data; weaker match for nonlinear mechanisms.",
        },
        {
            "concept": "BDeu score",
            "plain_language": "Score discrete Bayesian-network structures with Dirichlet priors.",
            "why_it_matters": "Useful for categorical data when states and priors are documented.",
        },
    ]
)
ges_concepts.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_concept_map.csv", index=False)
display(ges_concepts)

	concept	plain_language	why_it_matters
0	score-based discovery	Search for a graph that optimizes a numerical ...	The result depends on score assumptions and se...
1	GES forward phase	Greedily insert edges or operators that improv...	Early additions can shape the graph neighborho...
2	GES backward phase	Delete edges or operators when deletion improv...	The backward phase can simplify an overly dens...
3	BIC score	Fit a linear Gaussian local model and penalize...	Good match for linear Gaussian data; weaker ma...
4	BDeu score	Score discrete Bayesian-network structures wit...	Useful for categorical data when states and pr...

The concept map sets up the rest of the notebook. The key shift from PC is that we are now asking whether a graph scores well under a modeling assumption, not whether a specific conditional independence test rejects or fails to reject.

Helper Functions

This helper cell defines graph conversion, metric, update-history, and plotting utilities. The graph helpers are similar to earlier notebooks so the visual language stays consistent across the tutorial series.

def parse_causallearn_edge(edge):
    """Convert a causal-learn edge object into source, endpoint pattern, and target strings."""
    parts = str(edge).strip().split()
    if len(parts) != 3:
        return {"source": str(edge), "edge_type": "unknown", "target": "unknown"}
    return {"source": parts[0], "edge_type": parts[1], "target": parts[2]}


def graph_to_edge_table(graph, label):
    """Return a tidy edge table from a causal-learn graph object."""
    rows = [parse_causallearn_edge(edge) for edge in graph.get_graph_edges()]
    edge_df = pd.DataFrame(rows, columns=["source", "edge_type", "target"])
    if edge_df.empty:
        edge_df = pd.DataFrame(columns=["source", "edge_type", "target"])
    edge_df.insert(0, "run", label)
    return edge_df


def directed_pairs(edge_df):
    """Extract definite directed pairs from an edge table."""
    pairs = set()
    for row in edge_df.itertuples(index=False):
        if row.edge_type == "-->":
            pairs.add((row.source, row.target))
        elif row.edge_type == "<--":
            pairs.add((row.target, row.source))
    return pairs


def skeleton_pairs(edge_df):
    """Extract learned adjacencies while ignoring endpoint direction."""
    pairs = set()
    for row in edge_df.itertuples(index=False):
        if row.target != "unknown":
            pairs.add(frozenset([row.source, row.target]))
    return pairs


def summarize_against_truth(edge_df, truth_df, label):
    """Compute compact graph-recovery metrics against the synthetic truth table."""
    true_directed = set(zip(truth_df["source"], truth_df["target"]))
    true_skeleton = {frozenset(edge) for edge in true_directed}
    learned_directed = directed_pairs(edge_df)
    learned_skeleton = skeleton_pairs(edge_df)

    correct_directed = learned_directed & true_directed
    reversed_true = {(src, dst) for src, dst in true_directed if (dst, src) in learned_directed}
    missing_skeleton = true_skeleton - learned_skeleton
    extra_skeleton = learned_skeleton - true_skeleton

    unresolved_true = 0
    for src, dst in true_directed:
        pair = frozenset([src, dst])
        if pair in learned_skeleton and (src, dst) not in learned_directed and (dst, src) not in learned_directed:
            unresolved_true += 1

    directed_count = len(learned_directed)
    return pd.DataFrame(
        [
            {
                "run": label,
                "learned_edges_total": len(edge_df),
                "definite_directed_edges": directed_count,
                "true_edges": len(true_directed),
                "correct_directed_edges": len(correct_directed),
                "directed_precision": len(correct_directed) / directed_count if directed_count else np.nan,
                "directed_recall": len(correct_directed) / len(true_directed) if true_directed else np.nan,
                "reversed_true_edges": len(reversed_true),
                "unresolved_true_adjacencies": unresolved_true,
                "missing_true_adjacencies": len(missing_skeleton),
                "extra_adjacencies": len(extra_skeleton),
            }
        ]
    )


def classify_edges(edge_df, truth_df):
    """Label learned edges relative to the synthetic truth table."""
    true_directed = set(zip(truth_df["source"], truth_df["target"]))
    true_skeleton = {frozenset(edge) for edge in true_directed}
    rows = []
    for row in edge_df.itertuples(index=False):
        pair = frozenset([row.source, row.target])
        learned_direction = None
        if row.edge_type == "-->":
            learned_direction = (row.source, row.target)
        elif row.edge_type == "<--":
            learned_direction = (row.target, row.source)

        if learned_direction in true_directed:
            status = "correct directed edge"
        elif learned_direction and (learned_direction[1], learned_direction[0]) in true_directed:
            status = "reversed true edge"
        elif pair in true_skeleton:
            status = "true adjacency with uncertain or wrong endpoint"
        else:
            status = "extra adjacency"
        rows.append({"source": row.source, "edge_type": row.edge_type, "target": row.target, "status": status})
    return pd.DataFrame(rows)


def run_ges_bic(data_df, label, lambda_value=0.5, maxP=None):
    """Run BIC-GES quietly and return the record plus tidy edge table."""
    buffer = io.StringIO()
    with contextlib.redirect_stdout(buffer):
        record = ges(
            data_df.to_numpy(),
            score_func="local_score_BIC",
            node_names=list(data_df.columns),
            lambda_value=lambda_value,
            maxP=maxP,
        )
    edge_table = graph_to_edge_table(record["G"], label=label)
    messages = pd.DataFrame({"run": label, "message": [line for line in buffer.getvalue().splitlines() if line.strip()]})
    return record, edge_table, messages


def run_pc_baseline(data_df, label):
    """Run Fisher-Z PC as a constraint-based baseline."""
    result = pc(
        data_df.to_numpy(),
        alpha=0.05,
        indep_test="fisherz",
        stable=True,
        show_progress=False,
        node_names=list(data_df.columns),
    )
    return graph_to_edge_table(result.G, label=label)


def operation_table(record, variable_names):
    """Summarize GES forward and backward updates using variable names."""
    rows = []
    for phase, updates in [("forward_insert", record.get("update1", [])), ("backward_delete", record.get("update2", []))]:
        for step, update in enumerate(updates, start=1):
            source_idx, target_idx, conditioning_set = update
            rows.append(
                {
                    "phase": phase,
                    "step": step,
                    "source_index": int(source_idx),
                    "target_index": int(target_idx),
                    "source_variable": variable_names[int(source_idx)],
                    "target_variable": variable_names[int(target_idx)],
                    "conditioning_set": [variable_names[int(idx)] for idx in conditioning_set],
                }
            )
    return pd.DataFrame(rows)


GRAPH_POSITIONS = {
    "need": (0.11, 0.72),
    "intent": (0.11, 0.28),
    "match": (0.39, 0.50),
    "engagement": (0.62, 0.50),
    "renewal": (0.89, 0.72),
    "support": (0.89, 0.28),
}
NODE_LABELS = {name: name.title() for name in GRAPH_POSITIONS}
NODE_COLORS = {
    "need": "#e0f2fe",
    "intent": "#dbeafe",
    "match": "#ecfccb",
    "engagement": "#fef3c7",
    "renewal": "#fee2e2",
    "support": "#f3e8ff",
}


def trim_edge_to_box(start, end, box_w=0.145, box_h=0.095, gap=0.012):
    """Return edge endpoints that stop just outside source and target boxes."""
    x0, y0 = start
    x1, y1 = end
    dx = x1 - x0
    dy = y1 - y0
    length = float(np.hypot(dx, dy))
    if length == 0:
        return start, end
    effective_w = box_w + 0.04
    effective_h = box_h + 0.04
    x_limit = (effective_w / 2) / abs(dx) if dx else np.inf
    y_limit = (effective_h / 2) / abs(dy) if dy else np.inf
    t = min(x_limit, y_limit) + gap / length
    return (x0 + dx * t, y0 + dy * t), (x1 - dx * t, y1 - dy * t)


def draw_box_graph(edge_df, title, path, note=None):
    """Draw a DAG/CPDAG-style graph with rounded boxes and visible arrowheads."""
    fig, ax = plt.subplots(figsize=(12, 6.2))
    ax.set_axis_off()
    ax.set_xlim(-0.02, 1.02)
    ax.set_ylim(0.04, 0.96)
    box_w, box_h = 0.145, 0.095

    for row in edge_df.itertuples(index=False):
        if row.source not in GRAPH_POSITIONS or row.target not in GRAPH_POSITIONS:
            continue
        raw_start = GRAPH_POSITIONS[row.source]
        raw_end = GRAPH_POSITIONS[row.target]
        if row.edge_type == "<--":
            raw_start, raw_end = raw_end, raw_start
        start, end = trim_edge_to_box(raw_start, raw_end, box_w=box_w, box_h=box_h)
        if row.edge_type in {"-->", "<--"}:
            arrowstyle = "-|>"
            mutation_scale = 18
            linewidth = 1.8
            color = "#334155"
        else:
            arrowstyle = "-"
            mutation_scale = 1
            linewidth = 1.5
            color = "#64748b"
        arrow = FancyArrowPatch(
            start,
            end,
            arrowstyle=arrowstyle,
            mutation_scale=mutation_scale,
            linewidth=linewidth,
            color=color,
            connectionstyle="arc3,rad=0.035",
            zorder=2,
        )
        ax.add_patch(arrow)

    for node, (x, y) in GRAPH_POSITIONS.items():
        rect = FancyBboxPatch(
            (x - box_w / 2, y - box_h / 2),
            box_w,
            box_h,
            boxstyle="round,pad=0.018",
            facecolor=NODE_COLORS[node],
            edgecolor="#1f2937",
            linewidth=1.1,
            zorder=5,
        )
        ax.add_patch(rect)
        ax.text(x, y, NODE_LABELS[node], ha="center", va="center", fontsize=10.5, fontweight="bold", zorder=6)

    if note:
        ax.text(0.50, 0.08, note, ha="center", va="center", fontsize=10, color="#475569")
    ax.set_title(title, pad=18, fontsize=14, fontweight="bold")
    fig.savefig(path, dpi=160, bbox_inches="tight")
    plt.show()


def truth_as_edge_table(truth_df, label="truth"):
    """Convert a truth edge table to the plotting schema."""
    return truth_df.assign(run=label, edge_type="-->")[["run", "source", "edge_type", "target"]]

The helpers let us compare GES, PC, BIC penalties, sample sizes, and data-generating regimes using the same edge-table and graph-metric vocabulary.

Draw The Reference Graph

The reference graph is the synthetic data-generating structure. It is included for teaching evaluation, not because real causal discovery work would know the answer in advance.

# Draw the reference teaching graph.
true_edge_table = truth_as_edge_table(true_edges, label="true_graph")
true_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_true_dag.png"
draw_box_graph(
    true_edge_table,
    title="Reference Teaching DAG",
    path=true_graph_path,
    note="This graph is known only because the data are synthetic; it is used to evaluate discovery behavior.",
)

The reference graph gives us a visual target for the score-based runs. BIC-GES should perform best on the linear Gaussian dataset because that is the closest match to the score assumptions.

Continuous Data Audit

Before running a linear Gaussian score, we audit the linear Gaussian dataset. The BIC score uses covariance and local linear regressions, so the correlation structure is especially relevant.

# Summarize the continuous baseline data and save its correlation matrix.
linear_df = datasets["linear_gaussian"]
linear_audit = pd.DataFrame(
    {
        "variable": VARIABLES,
        "missing_rate": linear_df.isna().mean().reindex(VARIABLES).to_numpy(),
        "mean": linear_df.mean().reindex(VARIABLES).to_numpy(),
        "std": linear_df.std().reindex(VARIABLES).to_numpy(),
        "min": linear_df.min().reindex(VARIABLES).to_numpy(),
        "max": linear_df.max().reindex(VARIABLES).to_numpy(),
    }
)
linear_corr = linear_df.corr()

linear_audit.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_data_audit.csv", index=False)
linear_corr.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_correlation_matrix.csv")

display(linear_audit)
display(linear_corr.round(3))

	variable	mean	std	min	max
0	need	3.552714e-18	1.0002	-3.371900	3.316358
1	intent	-9.947598e-18	1.0002	-3.467962	3.389982
2	match	-3.552714e-19	1.0002	-3.596272	3.917631
3	engagement	-1.776357e-18	1.0002	-3.404627	3.469112
4	renewal	-8.526513e-18	1.0002	-3.418065	3.122549
5	support	1.563194e-17	1.0002	-3.250120	3.174916

	need	intent	match	engagement	renewal	support
need	1.000	0.007	0.595	0.494	0.211	0.311
intent	0.007	1.000	0.638	0.529	0.735	0.307
match	0.595	0.638	1.000	0.821	0.665	0.489
engagement	0.494	0.529	0.821	1.000	0.670	0.582
renewal	0.211	0.735	0.665	0.670	1.000	0.384
support	0.311	0.307	0.489	0.582	0.384	1.000

The data are complete and centered enough for a clean tutorial example. The correlation matrix shows dependence structure, but GES will use a graph score rather than treating pairwise correlations as graph edges.

Plot The Correlation Matrix

The heatmap gives a visual sense of the dependence structure that the score-based search will try to explain through parent sets.

# Plot the baseline correlation matrix.
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(linear_corr, annot=True, fmt=".2f", cmap="vlag", center=0, square=True, ax=ax)
ax.set_title("Linear Gaussian Correlation Matrix")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_correlation_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

The heatmap is only a diagnostic. The next cell runs GES and asks which graph scores best under the local linear Gaussian BIC objective.

Baseline BIC-GES On Linear Gaussian Data

This is the main GES run. We use the standard BIC penalty coefficient lambda_value=0.5, which corresponds to the usual BIC-style complexity penalty in causal-learn’s implementation.

# Run BIC-GES on the linear Gaussian dataset.
ges_record, ges_edges, ges_messages = run_ges_bic(
    linear_df,
    label="bic_ges_linear_gaussian",
    lambda_value=0.5,
)
ges_metrics = summarize_against_truth(ges_edges, true_edges, "bic_ges_linear_gaussian")
ges_classified = classify_edges(ges_edges, true_edges)

baseline_score_summary = pd.DataFrame(
    [
        {
            "run": "bic_ges_linear_gaussian",
            "score": ges_record["score"],
            "forward_updates": len(ges_record.get("update1", [])),
            "backward_updates": len(ges_record.get("update2", [])),
            "lambda_value": 0.5,
        }
    ]
)

baseline_score_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_score_summary.csv", index=False)
ges_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_edges.csv", index=False)
ges_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_metrics.csv", index=False)
ges_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_edge_classification.csv", index=False)
ges_messages.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_messages.csv", index=False)

display(baseline_score_summary)
display(ges_edges)
display(ges_metrics)
display(ges_classified)

	run	score	forward_updates	backward_updates	lambda_value
0	bic_ges_linear_gaussian	-2547.995676	8	2	0.5

	run	source	edge_type	target
0	bic_ges_linear_gaussian	need	-->	match
1	bic_ges_linear_gaussian	intent	-->	match
2	bic_ges_linear_gaussian	intent	-->	renewal
3	bic_ges_linear_gaussian	match	-->	engagement
4	bic_ges_linear_gaussian	engagement	-->	renewal
5	bic_ges_linear_gaussian	engagement	-->	support

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	missing_true_adjacencies	extra_adjacencies
0	bic_ges_linear_gaussian	6	6	6	6	1.0	1.0	0	0	0	0

	source	edge_type	target	status
0	need	-->	match	correct directed edge
1	intent	-->	match	correct directed edge
2	intent	-->	renewal	correct directed edge
3	match	-->	engagement	correct directed edge
4	engagement	-->	renewal	correct directed edge
5	engagement	-->	support	correct directed edge

On the linear Gaussian teaching data, BIC-GES recovers the intended graph cleanly. This is the friendly case: the data-generating assumptions line up well with the score.

Draw The Baseline GES Graph

The graph below is the baseline score-based discovery result. It should look familiar from the PC notebook because this synthetic dataset was intentionally easy for both Fisher-Z PC and BIC-GES.

# Draw and save the baseline GES graph.
baseline_ges_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_graph.png"
draw_box_graph(
    ges_edges,
    title="BIC-GES On Linear Gaussian Data",
    path=baseline_ges_graph_path,
    note="In this friendly setting, the score assumptions align with the data-generating process.",
)

The graph is clean because the example is well matched to BIC-GES. The next section opens the search record so students can see that GES reached this graph through forward and backward operations.

GES Update History

GES is greedy: it records forward insert operations and backward delete operations. The raw update objects are index-based, so this cell maps them back to variable names.

# Summarize the forward and backward GES update history.
ges_updates = operation_table(ges_record, VARIABLES)
ges_updates.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_update_history.csv", index=False)
display(ges_updates)

	phase	step	source_index	target_index	source_variable	target_variable	conditioning_set
0	forward_insert	1	2	3	match	engagement	[]
1	forward_insert	2	4	1	renewal	intent	[]
2	forward_insert	3	4	3	renewal	engagement	[]
3	forward_insert	4	2	0	match	need	[]
4	forward_insert	5	1	0	intent	need	[match]
5	forward_insert	6	3	5	engagement	support	[]
6	forward_insert	7	1	2	intent	match	[engagement]
7	forward_insert	8	1	3	intent	engagement	[]
8	backward_delete	1	0	1	need	intent	[match]
9	backward_delete	2	1	3	intent	engagement	[]

The update table shows the search path rather than just the final graph. This is useful when debugging a score-based result because two final graphs can have similar scores but different search trajectories.

Compare GES With PC

Now we run PC on the same linear Gaussian dataset. This is not a contest to crown one method permanently better. It is a controlled comparison between constraint-based and score-based discovery under friendly assumptions.

# Run PC on the same linear Gaussian data and compare with GES.
pc_edges = run_pc_baseline(linear_df, label="pc_fisherz_linear_gaussian")
pc_metrics = summarize_against_truth(pc_edges, true_edges, "pc_fisherz_linear_gaussian")
pc_classified = classify_edges(pc_edges, true_edges)

pc_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_edges.csv", index=False)
pc_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_metrics.csv", index=False)
pc_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_edge_classification.csv", index=False)

method_comparison = pd.concat([ges_metrics, pc_metrics], ignore_index=True)
method_comparison.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_vs_pc_metrics.csv", index=False)

display(pc_edges)
display(method_comparison)

	run	source	edge_type	target
0	pc_fisherz_linear_gaussian	need	-->	match
1	pc_fisherz_linear_gaussian	intent	-->	match
2	pc_fisherz_linear_gaussian	intent	-->	renewal
3	pc_fisherz_linear_gaussian	match	-->	engagement
4	pc_fisherz_linear_gaussian	engagement	-->	renewal
5	pc_fisherz_linear_gaussian	engagement	-->	support

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	missing_true_adjacencies	extra_adjacencies
0	bic_ges_linear_gaussian	6	6	6	6	1.0	1.0	0	0	0	0
1	pc_fisherz_linear_gaussian	6	6	6	6	1.0	1.0	0	0	0	0

In this controlled case, both methods recover the graph well. That agreement is useful: when methods with different search logic agree under plausible assumptions, the candidate structure is easier to trust.

Plot GES Versus PC Metrics

A compact plot makes the method comparison easier to scan. We focus on correct directed edges, reversed true edges, unresolved true adjacencies, and extra adjacencies.

# Plot GES and PC metric comparison.
fig, ax = plt.subplots(figsize=(10, 5))
plot_df = method_comparison.melt(
    id_vars="run",
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.barplot(data=plot_df, x="metric", y="count", hue="run", ax=ax, palette=["#38bdf8", "#94a3b8"])
ax.set_title("BIC-GES Versus Fisher-Z PC On Linear Gaussian Data")
ax.set_xlabel("")
ax.set_ylabel("Count")
ax.tick_params(axis="x", rotation=20)
ax.legend(title="Run", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_ges_vs_pc_metrics.png", dpi=160, bbox_inches="tight")
plt.show()

The plot reinforces the friendly-case message. The next sections make the notebook more realistic by changing penalty strength, sample size, and data-generating assumptions.

BIC Penalty Sensitivity

The BIC penalty controls how much complexity is punished. A lower penalty can allow denser graphs; a higher penalty can prefer simpler graphs. This cell runs BIC-GES across several lambda_value choices.

# Run BIC-GES across penalty strengths.
lambda_values = [0.25, 0.5, 1.0, 2.0]
lambda_metric_rows = []
lambda_edge_rows = []
lambda_score_rows = []

for lambda_value in lambda_values:
    label = f"bic_ges_lambda_{lambda_value:g}"
    record, edge_table, messages = run_ges_bic(linear_df, label=label, lambda_value=lambda_value)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(lambda_value=lambda_value)
    lambda_metric_rows.append(metrics)
    lambda_edge_rows.append(edge_table.assign(lambda_value=lambda_value))
    lambda_score_rows.append(
        {
            "run": label,
            "lambda_value": lambda_value,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

lambda_metrics = pd.concat(lambda_metric_rows, ignore_index=True)
lambda_edges = pd.concat(lambda_edge_rows, ignore_index=True)
lambda_scores = pd.DataFrame(lambda_score_rows)

lambda_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_metrics.csv", index=False)
lambda_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_edges.csv", index=False)
lambda_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_scores.csv", index=False)

display(lambda_scores)
display(lambda_metrics)
display(lambda_edges)

	run	lambda_value	score	forward_updates	backward_updates
0	bic_ges_lambda_0.25	0.25	-2524.523538	8	2
1	bic_ges_lambda_0.5	0.50	-2547.995676	8	2
2	bic_ges_lambda_1	1.00	-2608.870598	7	0
3	bic_ges_lambda_2	2.00	-2710.583196	7	0

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	extra_adjacencies	lambda_value
0	bic_ges_lambda_0.25	6	6	6	6	1.00	1.000000	0	0	0	0.25
1	bic_ges_lambda_0.5	6	6	6	6	1.00	1.000000	0	0	0	0.50
2	bic_ges_lambda_1	7	4	6	1	0.25	0.166667	2	3	1	1.00
3	bic_ges_lambda_2	7	4	6	1	0.25	0.166667	2	3	1	2.00

	run	source	edge_type	target	lambda_value
0	bic_ges_lambda_0.25	need	-->	match	0.25
1	bic_ges_lambda_0.25	intent	-->	match	0.25
2	bic_ges_lambda_0.25	intent	-->	renewal	0.25
3	bic_ges_lambda_0.25	match	-->	engagement	0.25
4	bic_ges_lambda_0.25	engagement	-->	renewal	0.25
5	bic_ges_lambda_0.25	engagement	-->	support	0.25
6	bic_ges_lambda_0.5	need	-->	match	0.50
7	bic_ges_lambda_0.5	intent	-->	match	0.50
8	bic_ges_lambda_0.5	intent	-->	renewal	0.50
9	bic_ges_lambda_0.5	match	-->	engagement	0.50
10	bic_ges_lambda_0.5	engagement	-->	renewal	0.50
11	bic_ges_lambda_0.5	engagement	-->	support	0.50
12	bic_ges_lambda_1	intent	-->	need	1.00
13	bic_ges_lambda_1	match	-->	need	1.00
14	bic_ges_lambda_1	intent	-->	match	1.00
15	bic_ges_lambda_1	intent	---	renewal	1.00
16	bic_ges_lambda_1	engagement	-->	match	1.00
17	bic_ges_lambda_1	engagement	---	renewal	1.00
18	bic_ges_lambda_1	engagement	---	support	1.00
19	bic_ges_lambda_2	intent	-->	need	2.00
20	bic_ges_lambda_2	match	-->	need	2.00
21	bic_ges_lambda_2	intent	-->	match	2.00
22	bic_ges_lambda_2	intent	---	renewal	2.00
23	bic_ges_lambda_2	engagement	-->	match	2.00
24	bic_ges_lambda_2	engagement	---	renewal	2.00
25	bic_ges_lambda_2	engagement	---	support	2.00

The penalty sensitivity table shows that the standard penalty works well here, while stronger penalties can change the discovered structure. This is why score-based results should report tuning choices rather than presenting the graph as automatic truth.

Plot Penalty Sensitivity

The plot tracks graph quality metrics across BIC penalty strengths. It helps students see when a score penalty starts changing the graph materially.

# Plot BIC penalty sensitivity.
fig, ax = plt.subplots(figsize=(10, 5))
lambda_plot = lambda_metrics.melt(
    id_vars=["run", "lambda_value"],
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.lineplot(data=lambda_plot, x="lambda_value", y="count", hue="metric", marker="o", ax=ax)
ax.set_title("BIC-GES Penalty Sensitivity")
ax.set_xlabel("BIC Penalty Coefficient")
ax.set_ylabel("Count")
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The sensitivity curve makes tuning visible. If a graph only appears for one penalty value, the report should treat it as less stable than a graph that persists across nearby choices.

Sample-Size Sensitivity

Score-based search can be unstable with small samples because local score differences are harder to estimate. This cell reruns BIC-GES on subsamples of increasing size.

# Run BIC-GES on increasing sample sizes.
sample_sizes = [100, 250, 500, 1000, len(linear_df)]
sample_metric_rows = []
sample_edge_rows = []
sample_score_rows = []

for sample_size in sample_sizes:
    sample_df = linear_df.sample(n=sample_size, random_state=908 + sample_size)
    label = f"bic_ges_n_{sample_size}"
    record, edge_table, messages = run_ges_bic(sample_df, label=label, lambda_value=0.5)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(sample_size=sample_size)
    sample_metric_rows.append(metrics)
    sample_edge_rows.append(edge_table.assign(sample_size=sample_size))
    sample_score_rows.append(
        {
            "run": label,
            "sample_size": sample_size,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

sample_metrics = pd.concat(sample_metric_rows, ignore_index=True)
sample_edges = pd.concat(sample_edge_rows, ignore_index=True)
sample_scores = pd.DataFrame(sample_score_rows)

sample_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_metrics.csv", index=False)
sample_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_edges.csv", index=False)
sample_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_scores.csv", index=False)

display(sample_scores)
display(sample_metrics)
display(sample_edges.head(20))

	run	sample_size	score	forward_updates	backward_updates
0	bic_ges_n_100	100	-79.144994	7	0
1	bic_ges_n_250	250	-245.731976	8	2
2	bic_ges_n_500	500	-511.203702	8	2
3	bic_ges_n_1000	1000	-962.730373	8	2
4	bic_ges_n_2500	2500	-2547.995676	8	2

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	unresolved_true_adjacencies	extra_adjacencies	sample_size
0	bic_ges_n_100	7	2	6	2	1.0	0.333333	4	1	100
1	bic_ges_n_250	6	6	6	6	1.0	1.000000	0	0	250
2	bic_ges_n_500	6	6	6	6	1.0	1.000000	0	0	500
3	bic_ges_n_1000	6	6	6	6	1.0	1.000000	0	0	1000
4	bic_ges_n_2500	6	6	6	6	1.0	1.000000	0	0	2500

	run	source	edge_type	target	sample_size
0	bic_ges_n_100	need	---	intent	100
1	bic_ges_n_100	need	---	match	100
2	bic_ges_n_100	intent	---	match	100
3	bic_ges_n_100	intent	-->	renewal	100
4	bic_ges_n_100	match	---	engagement	100
5	bic_ges_n_100	engagement	-->	renewal	100
6	bic_ges_n_100	engagement	---	support	100
7	bic_ges_n_250	need	-->	match	250
8	bic_ges_n_250	intent	-->	match	250
9	bic_ges_n_250	intent	-->	renewal	250
10	bic_ges_n_250	match	-->	engagement	250
11	bic_ges_n_250	engagement	-->	renewal	250
12	bic_ges_n_250	engagement	-->	support	250
13	bic_ges_n_500	need	-->	match	500
14	bic_ges_n_500	intent	-->	match	500
15	bic_ges_n_500	intent	-->	renewal	500
16	bic_ges_n_500	match	-->	engagement	500
17	bic_ges_n_500	engagement	-->	renewal	500
18	bic_ges_n_500	engagement	-->	support	500
19	bic_ges_n_1000	need	-->	match	1000

Small samples can leave edges unoriented or add extra adjacencies. By a few hundred rows in this synthetic setup, BIC-GES becomes much more stable, which is exactly the kind of sample-size story a good report should include.

Plot Sample-Size Sensitivity

This plot makes the sample-size pattern easier to read. It tracks correct directed edges, unresolved true adjacencies, and extra adjacencies as the available data grow.

# Plot sample-size sensitivity for BIC-GES.
fig, ax = plt.subplots(figsize=(10, 5))
sample_plot = sample_metrics.melt(
    id_vars=["run", "sample_size"],
    value_vars=["correct_directed_edges", "unresolved_true_adjacencies", "extra_adjacencies", "missing_true_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.lineplot(data=sample_plot, x="sample_size", y="count", hue="metric", marker="o", ax=ax)
ax.set_title("BIC-GES Sample-Size Sensitivity")
ax.set_xlabel("Sample Size")
ax.set_ylabel("Count")
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The sample-size plot is a useful diagnostic for any score-based discovery workflow. Stable graph recovery at larger samples is more reassuring than a graph that only appears in one small subsample.

Model-Mismatch Stress Test

BIC-GES is a linear Gaussian score. This section runs the same method on three datasets with different mechanisms. The purpose is not to say BIC-GES is bad; it is to show that score assumptions matter.

# Run BIC-GES on datasets with increasingly mismatched mechanisms.
regime_metric_rows = []
regime_edge_rows = []
regime_score_rows = []
for dataset_name in ["linear_gaussian", "linear_nongaussian", "nonlinear_continuous"]:
    label = f"bic_ges_{dataset_name}"
    record, edge_table, messages = run_ges_bic(datasets[dataset_name], label=label, lambda_value=0.5)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(dataset=dataset_name)
    regime_metric_rows.append(metrics)
    regime_edge_rows.append(edge_table.assign(dataset=dataset_name))
    regime_score_rows.append(
        {
            "run": label,
            "dataset": dataset_name,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

regime_metrics = pd.concat(regime_metric_rows, ignore_index=True)
regime_edges = pd.concat(regime_edge_rows, ignore_index=True)
regime_scores = pd.DataFrame(regime_score_rows)

regime_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_metrics.csv", index=False)
regime_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_edges.csv", index=False)
regime_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_scores.csv", index=False)

display(regime_scores)
display(regime_metrics)
display(regime_edges)

	run	dataset	score	forward_updates	backward_updates
0	bic_ges_linear_gaussian	linear_gaussian	-2547.995676	8	2
1	bic_ges_linear_nongaussian	linear_nongaussian	-2561.286088	8	1
2	bic_ges_nonlinear_continuous	nonlinear_continuous	-4253.218605	9	0

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	extra_adjacencies	dataset
0	bic_ges_linear_gaussian	6	6	6	6	1.000000	1.000000	0	0	0	linear_gaussian
1	bic_ges_linear_nongaussian	7	7	6	6	0.857143	1.000000	0	0	1	linear_nongaussian
2	bic_ges_nonlinear_continuous	9	6	6	1	0.166667	0.166667	2	3	3	nonlinear_continuous

	run	source	edge_type	target	dataset
0	bic_ges_linear_gaussian	need	-->	match	linear_gaussian
1	bic_ges_linear_gaussian	intent	-->	match	linear_gaussian
2	bic_ges_linear_gaussian	intent	-->	renewal	linear_gaussian
3	bic_ges_linear_gaussian	match	-->	engagement	linear_gaussian
4	bic_ges_linear_gaussian	engagement	-->	renewal	linear_gaussian
5	bic_ges_linear_gaussian	engagement	-->	support	linear_gaussian
6	bic_ges_linear_nongaussian	need	-->	match	linear_nongaussian
7	bic_ges_linear_nongaussian	intent	-->	match	linear_nongaussian
8	bic_ges_linear_nongaussian	intent	-->	engagement	linear_nongaussian
9	bic_ges_linear_nongaussian	intent	-->	renewal	linear_nongaussian
10	bic_ges_linear_nongaussian	match	-->	engagement	linear_nongaussian
11	bic_ges_linear_nongaussian	engagement	-->	renewal	linear_nongaussian
12	bic_ges_linear_nongaussian	engagement	-->	support	linear_nongaussian
13	bic_ges_nonlinear_continuous	intent	-->	need	nonlinear_continuous
14	bic_ges_nonlinear_continuous	match	-->	need	nonlinear_continuous
15	bic_ges_nonlinear_continuous	engagement	-->	need	nonlinear_continuous
16	bic_ges_nonlinear_continuous	intent	-->	match	nonlinear_continuous
17	bic_ges_nonlinear_continuous	intent	---	renewal	nonlinear_continuous
18	bic_ges_nonlinear_continuous	engagement	-->	match	nonlinear_continuous
19	bic_ges_nonlinear_continuous	renewal	-->	match	nonlinear_continuous
20	bic_ges_nonlinear_continuous	engagement	---	renewal	nonlinear_continuous
21	bic_ges_nonlinear_continuous	engagement	---	support	nonlinear_continuous

The stress test shows the main limitation of a score: when the score is mismatched to the data-generating process, the discovered graph can deteriorate. The nonlinear dataset is especially hard for a linear Gaussian score.

Plot Model-Mismatch Results

The plot compares graph recovery across the three mechanism regimes. It is a quick way to see where BIC-GES is reliable and where its assumptions start to fray.

# Plot the mechanism mismatch stress test.
fig, ax = plt.subplots(figsize=(10, 5))
regime_plot = regime_metrics.melt(
    id_vars=["run", "dataset"],
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.barplot(data=regime_plot, x="dataset", y="count", hue="metric", ax=ax)
ax.set_title("BIC-GES Under Mechanism Mismatch")
ax.set_xlabel("Dataset")
ax.set_ylabel("Count")
ax.tick_params(axis="x", rotation=15)
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test.png", dpi=160, bbox_inches="tight")
plt.show()

The plot is the cautionary center of the notebook. Score-based discovery is only as credible as the score family and the data representation that support it.

Draw The Nonlinear Stress-Test Graph

The nonlinear graph is worth drawing because it shows what model mismatch looks like visually: extra arrows, reversed directions, and less faithful structure.

# Draw the BIC-GES result on nonlinear continuous data.
nonlinear_edges = regime_edges[regime_edges["dataset"] == "nonlinear_continuous"].drop(columns=["dataset"])
nonlinear_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nonlinear_bic_ges_graph.png"
draw_box_graph(
    nonlinear_edges,
    title="BIC-GES On Nonlinear Continuous Data",
    path=nonlinear_graph_path,
    note="A linear Gaussian score can misread nonlinear dependence structure.",
)

The nonlinear graph is less trustworthy than the baseline graph. This is where a report should consider nonlinear scores, transformations, or a different discovery method rather than forcing BIC-GES to carry the whole analysis.

Discrete Score Example: BDeu-GES

BIC is not the only score. For discrete data, causal-learn supports BDeu-style scoring. This section runs GES on the discrete teaching dataset to show how the score family changes with data type.

# Run BDeu-GES on the discrete dataset.
discrete_df = datasets["discrete_mixed"].astype(int)
r_i_map = {i: int(discrete_df.iloc[:, i].nunique()) for i in range(discrete_df.shape[1])}
bdeu_parameters = {"sample_prior": 1, "structure_prior": 1, "r_i_map": r_i_map}

buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    bdeu_record = ges(
        discrete_df.to_numpy(),
        score_func="local_score_BDeu",
        node_names=list(discrete_df.columns),
        parameters=bdeu_parameters,
    )

bdeu_edges = graph_to_edge_table(bdeu_record["G"], label="bdeu_ges_discrete_mixed")
bdeu_metrics = summarize_against_truth(bdeu_edges, discrete_true_edges, "bdeu_ges_discrete_mixed")
bdeu_classified = classify_edges(bdeu_edges, discrete_true_edges)
bdeu_score_summary = pd.DataFrame(
    [
        {
            "run": "bdeu_ges_discrete_mixed",
            "score": bdeu_record["score"],
            "forward_updates": len(bdeu_record.get("update1", [])),
            "backward_updates": len(bdeu_record.get("update2", [])),
            "sample_prior": bdeu_parameters["sample_prior"],
            "structure_prior": bdeu_parameters["structure_prior"],
        }
    ]
)

bdeu_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_edges.csv", index=False)
bdeu_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_metrics.csv", index=False)
bdeu_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_edge_classification.csv", index=False)
bdeu_score_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_score_summary.csv", index=False)

display(bdeu_score_summary)
display(bdeu_edges)
display(bdeu_metrics)
display(bdeu_classified)

	run	score	forward_updates	backward_updates	sample_prior	structure_prior
0	bdeu_ges_discrete_mixed	-10499.860061	6	0	1	1

	run	source	edge_type	target
0	bdeu_ges_discrete_mixed	need	---	match
1	bdeu_ges_discrete_mixed	match	-->	intent
2	bdeu_ges_discrete_mixed	renewal	-->	intent
3	bdeu_ges_discrete_mixed	match	---	engagement
4	bdeu_ges_discrete_mixed	engagement	---	renewal
5	bdeu_ges_discrete_mixed	engagement	---	support

	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	missing_true_adjacencies	extra_adjacencies
0	bdeu_ges_discrete_mixed	6	2	6	0	0.0	0.0	2	4	0	0

	source	edge_type	target	status
0	need	---	match	true adjacency with uncertain or wrong endpoint
1	match	-->	intent	reversed true edge
2	renewal	-->	intent	reversed true edge
3	match	---	engagement	true adjacency with uncertain or wrong endpoint
4	engagement	---	renewal	true adjacency with uncertain or wrong endpoint
5	engagement	---	support	true adjacency with uncertain or wrong endpoint

The BDeu example is intentionally more modest than the BIC baseline. It shows how to switch score families for discrete data, while also showing that discretization and prior choices can make orientation recovery harder.

Draw The BDeu-GES Graph

The discrete graph is drawn with the same layout so it can be compared against the continuous BIC-GES graph. Undirected edges are shown as plain lines because GES can leave equivalence-class uncertainty.

# Draw the BDeu-GES graph on discrete data.
bdeu_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_graph.png"
draw_box_graph(
    bdeu_edges,
    title="BDeu-GES On Discrete Teaching Data",
    path=bdeu_graph_path,
    note="Discrete scoring matches categorical data, but orientation can remain uncertain.",
)

The BDeu graph is a good reminder that matching score type to data type is necessary but not sufficient. A score can be appropriate and still leave several edges uncertain.

Score Function Runtime And Use Cases

The next table summarizes score functions you may encounter in causal-learn. The notebook runs the fast, stable choices by default and leaves the slower kernel/CV scores as optional experiments.

# Summarize score choices and approximate runtime expectations for this tutorial setup.
score_function_guide = pd.DataFrame(
    [
        {
            "score_function": "local_score_BIC",
            "data_type": "continuous",
            "main_assumption": "linear Gaussian local models",
            "run_by_default": True,
            "approx_runtime_on_this_dataset": "seconds",
            "when_to_try": "baseline score-based discovery for continuous data",
        },
        {
            "score_function": "local_score_BDeu",
            "data_type": "discrete",
            "main_assumption": "categorical Bayesian-network score with documented priors",
            "run_by_default": True,
            "approx_runtime_on_this_dataset": "seconds",
            "when_to_try": "discrete variables with a clear state space",
        },
        {
            "score_function": "local_score_CV_general",
            "data_type": "continuous",
            "main_assumption": "kernel regression with cross-validated likelihood",
            "run_by_default": False,
            "approx_runtime_on_this_dataset": "5-15+ minutes",
            "when_to_try": "nonlinear relationships when longer runtime is acceptable",
        },
        {
            "score_function": "local_score_marginal_general",
            "data_type": "continuous",
            "main_assumption": "kernel marginal likelihood style score",
            "run_by_default": False,
            "approx_runtime_on_this_dataset": "5-15+ minutes or more",
            "when_to_try": "nonlinear sensitivity checks on smaller samples first",
        },
    ]
)
score_function_guide.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_score_function_guide.csv", index=False)
display(score_function_guide)

	score_function	data_type	main_assumption	run_by_default	approx_runtime_on_this_dataset	when_to_try
0	local_score_BIC	continuous	linear Gaussian local models	True	seconds	baseline score-based discovery for continuous ...
1	local_score_BDeu	discrete	categorical Bayesian-network score with docume...	True	seconds	discrete variables with a clear state space
2	local_score_CV_general	continuous	kernel regression with cross-validated likelihood	False	5-15+ minutes	nonlinear relationships when longer runtime is...
3	local_score_marginal_general	continuous	kernel marginal likelihood style score	False	5-15+ minutes or more	nonlinear sensitivity checks on smaller sample...

This guide is included because score choice is both statistical and computational. A slower nonlinear score should usually be tested on a smaller sample or reduced variable set before it becomes part of a default notebook pipeline.

Combined Scenario Summary

This table gathers the main runs into one place: baseline BIC-GES, PC, selected penalty runs, sample-size runs, mechanism stress tests, and BDeu-GES. It becomes the compact summary for the notebook.

# Combine the most important metric tables into one scenario summary.
scenario_summary = pd.concat(
    [
        ges_metrics.assign(section="baseline_bic_ges"),
        pc_metrics.assign(section="constraint_based_comparison"),
        lambda_metrics.assign(section="bic_penalty_sensitivity"),
        sample_metrics.assign(section="sample_size_sensitivity"),
        regime_metrics.assign(section="mechanism_stress_test"),
        bdeu_metrics.assign(section="discrete_bdeu_example"),
    ],
    ignore_index=True,
    sort=False,
)
scenario_summary = scenario_summary[["section"] + [col for col in scenario_summary.columns if col != "section"]]
scenario_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_scenario_summary_metrics.csv", index=False)
display(scenario_summary)

	section	run	learned_edges_total	definite_directed_edges	true_edges	correct_directed_edges	directed_precision	directed_recall	reversed_true_edges	unresolved_true_adjacencies	extra_adjacencies	lambda_value	sample_size	dataset
0	baseline_bic_ges	bic_ges_linear_gaussian	6	6	6	6	1.000000	1.000000	0	0	0	NaN	NaN	NaN
1	constraint_based_comparison	pc_fisherz_linear_gaussian	6	6	6	6	1.000000	1.000000	0	0	0	NaN	NaN	NaN
2	bic_penalty_sensitivity	bic_ges_lambda_0.25	6	6	6	6	1.000000	1.000000	0	0	0	0.25	NaN	NaN
3	bic_penalty_sensitivity	bic_ges_lambda_0.5	6	6	6	6	1.000000	1.000000	0	0	0	0.50	NaN	NaN
4	bic_penalty_sensitivity	bic_ges_lambda_1	7	4	6	1	0.250000	0.166667	2	3	1	1.00	NaN	NaN
5	bic_penalty_sensitivity	bic_ges_lambda_2	7	4	6	1	0.250000	0.166667	2	3	1	2.00	NaN	NaN
6	sample_size_sensitivity	bic_ges_n_100	7	2	6	2	1.000000	0.333333	0	4	1	NaN	100.0	NaN
7	sample_size_sensitivity	bic_ges_n_250	6	6	6	6	1.000000	1.000000	0	0	0	NaN	250.0	NaN
8	sample_size_sensitivity	bic_ges_n_500	6	6	6	6	1.000000	1.000000	0	0	0	NaN	500.0	NaN
9	sample_size_sensitivity	bic_ges_n_1000	6	6	6	6	1.000000	1.000000	0	0	0	NaN	1000.0	NaN
10	sample_size_sensitivity	bic_ges_n_2500	6	6	6	6	1.000000	1.000000	0	0	0	NaN	2500.0	NaN
11	mechanism_stress_test	bic_ges_linear_gaussian	6	6	6	6	1.000000	1.000000	0	0	0	NaN	NaN	linear_gaussian
12	mechanism_stress_test	bic_ges_linear_nongaussian	7	7	6	6	0.857143	1.000000	0	0	1	NaN	NaN	linear_nongaussian
13	mechanism_stress_test	bic_ges_nonlinear_continuous	9	6	6	1	0.166667	0.166667	2	3	3	NaN	NaN	nonlinear_continuous
14	discrete_bdeu_example	bdeu_ges_discrete_mixed	6	2	6	0	0.000000	0.000000	2	4	0	NaN	NaN	NaN

The scenario summary is the main audit table. It shows where score-based discovery works cleanly and where the result becomes sensitive to penalty, sample size, or model mismatch.

GES Reporting Checklist

A score-based discovery report should document more than the final graph. The checklist below captures the key details that make a GES result easier to evaluate.

# Save a practical checklist for reporting score-based discovery results.
reporting_checklist = pd.DataFrame(
    [
        {
            "topic": "score choice",
            "question_to_answer": "Which score was used, and why does it match the data type and mechanism assumptions?",
            "reporting_note": "BIC, BDeu, and kernel scores answer different modeling questions.",
        },
        {
            "topic": "penalty or priors",
            "question_to_answer": "What penalty coefficient or score prior was used?",
            "reporting_note": "Graph density can change when the complexity penalty changes.",
        },
        {
            "topic": "search path",
            "question_to_answer": "How many forward and backward updates occurred?",
            "reporting_note": "The final graph is easier to debug when the search path is visible.",
        },
        {
            "topic": "sample-size sensitivity",
            "question_to_answer": "Does the graph stabilize as sample size increases?",
            "reporting_note": "Small-sample score differences can be noisy.",
        },
        {
            "topic": "model mismatch",
            "question_to_answer": "What happens when the same score is run on data with different mechanisms?",
            "reporting_note": "A score can be fast and elegant but still wrong for nonlinear data.",
        },
        {
            "topic": "claim strength",
            "question_to_answer": "Which edges are stable candidate structure rather than confirmed causal directions?",
            "reporting_note": "GES returns a discovered graph under assumptions, not a final causal proof.",
        },
    ]
)
reporting_checklist.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_reporting_checklist.csv", index=False)
display(reporting_checklist)

	topic	question_to_answer	reporting_note
0	score choice	Which score was used, and why does it match th...	BIC, BDeu, and kernel scores answer different ...
1	penalty or priors	What penalty coefficient or score prior was used?	Graph density can change when the complexity p...
2	search path	How many forward and backward updates occurred?	The final graph is easier to debug when the se...
3	sample-size sensitivity	Does the graph stabilize as sample size increa...	Small-sample score differences can be noisy.
4	model mismatch	What happens when the same score is run on dat...	A score can be fast and elegant but still wron...
5	claim strength	Which edges are stable candidate structure rat...	GES returns a discovered graph under assumptio...

The checklist turns the notebook into reusable practice. Score-based discovery is most useful when the score assumptions, tuning choices, and stability checks are all visible.

Artifact Manifest

The final code cell lists the generated figures and tables. This makes the notebook outputs easy to find later.

# Inventory artifacts generated by this notebook.
artifact_rows = []
for folder, artifact_type in [(TABLE_DIR, "table"), (FIGURE_DIR, "figure")]:
    for artifact_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": artifact_path.name,
                "relative_path": str(artifact_path.relative_to(NOTEBOOK_DIR)),
                "size_kb": round(artifact_path.stat().st_size / 1024, 1),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
artifact_manifest.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_artifact_manifest.csv", index=False)
display(artifact_manifest)

	artifact_type	file_name	relative_path	size_kb
0	table	08_baseline_ges_edge_classification.csv	outputs/tables/08_baseline_ges_edge_classifica...	0.3
1	table	08_baseline_ges_edges.csv	outputs/tables/08_baseline_ges_edges.csv	0.3
2	table	08_baseline_ges_messages.csv	outputs/tables/08_baseline_ges_messages.csv	0.0
3	table	08_baseline_ges_metrics.csv	outputs/tables/08_baseline_ges_metrics.csv	0.2
4	table	08_baseline_ges_score_summary.csv	outputs/tables/08_baseline_ges_score_summary.csv	0.1
5	table	08_baseline_ges_update_history.csv	outputs/tables/08_baseline_ges_update_history.csv	0.5
6	table	08_bdeu_ges_edge_classification.csv	outputs/tables/08_bdeu_ges_edge_classification...	0.4
7	table	08_bdeu_ges_edges.csv	outputs/tables/08_bdeu_ges_edges.csv	0.3
8	table	08_bdeu_ges_metrics.csv	outputs/tables/08_bdeu_ges_metrics.csv	0.2
9	table	08_bdeu_ges_score_summary.csv	outputs/tables/08_bdeu_ges_score_summary.csv	0.1
10	table	08_bic_lambda_sensitivity_edges.csv	outputs/tables/08_bic_lambda_sensitivity_edges...	1.1
11	table	08_bic_lambda_sensitivity_metrics.csv	outputs/tables/08_bic_lambda_sensitivity_metri...	0.4
12	table	08_bic_lambda_sensitivity_scores.csv	outputs/tables/08_bic_lambda_sensitivity_score...	0.2
13	table	08_ges_concept_map.csv	outputs/tables/08_ges_concept_map.csv	0.8
14	table	08_ges_reporting_checklist.csv	outputs/tables/08_ges_reporting_checklist.csv	0.9
15	table	08_ges_vs_pc_metrics.csv	outputs/tables/08_ges_vs_pc_metrics.csv	0.3
16	table	08_linear_gaussian_correlation_matrix.csv	outputs/tables/08_linear_gaussian_correlation_...	0.7
17	table	08_linear_gaussian_data_audit.csv	outputs/tables/08_linear_gaussian_data_audit.csv	0.6
18	table	08_loaded_dataset_summary.csv	outputs/tables/08_loaded_dataset_summary.csv	0.2
19	table	08_mechanism_stress_test_edges.csv	outputs/tables/08_mechanism_stress_test_edges.csv	1.4
20	table	08_mechanism_stress_test_metrics.csv	outputs/tables/08_mechanism_stress_test_metric...	0.5
21	table	08_mechanism_stress_test_scores.csv	outputs/tables/08_mechanism_stress_test_scores...	0.2
22	table	08_package_versions.csv	outputs/tables/08_package_versions.csv	0.1
23	table	08_pc_baseline_edge_classification.csv	outputs/tables/08_pc_baseline_edge_classificat...	0.3
24	table	08_pc_baseline_edges.csv	outputs/tables/08_pc_baseline_edges.csv	0.3
25	table	08_pc_baseline_metrics.csv	outputs/tables/08_pc_baseline_metrics.csv	0.3
26	table	08_sample_size_sensitivity_edges.csv	outputs/tables/08_sample_size_sensitivity_edge...	1.2
27	table	08_sample_size_sensitivity_metrics.csv	outputs/tables/08_sample_size_sensitivity_metr...	0.4
28	table	08_sample_size_sensitivity_scores.csv	outputs/tables/08_sample_size_sensitivity_scor...	0.3
29	table	08_scenario_summary_metrics.csv	outputs/tables/08_scenario_summary_metrics.csv	1.5
30	table	08_score_function_guide.csv	outputs/tables/08_score_function_guide.csv	0.7
31	figure	08_baseline_ges_graph.png	outputs/figures/08_baseline_ges_graph.png	66.2
32	figure	08_bdeu_ges_graph.png	outputs/figures/08_bdeu_ges_graph.png	66.2
33	figure	08_bic_lambda_sensitivity.png	outputs/figures/08_bic_lambda_sensitivity.png	83.6
34	figure	08_ges_vs_pc_metrics.png	outputs/figures/08_ges_vs_pc_metrics.png	87.3
35	figure	08_linear_gaussian_correlation_heatmap.png	outputs/figures/08_linear_gaussian_correlation...	93.8
36	figure	08_mechanism_stress_test.png	outputs/figures/08_mechanism_stress_test.png	78.9
37	figure	08_nonlinear_bic_ges_graph.png	outputs/figures/08_nonlinear_bic_ges_graph.png	77.1
38	figure	08_sample_size_sensitivity.png	outputs/figures/08_sample_size_sensitivity.png	71.5
39	figure	08_true_dag.png	outputs/figures/08_true_dag.png	67.8

This notebook now has a complete score-based discovery workflow: BIC-GES, PC comparison, search history, penalty sensitivity, sample-size sensitivity, mechanism mismatch, discrete BDeu scoring, runtime guidance, and reporting artifacts. The next tutorial can go deeper into exact search and score functions on smaller graphs.